How this grassroots effort could make AI voices more diverse

Ryakitimbo has collected voice data in Kiswahili in Tanzania, Kenya, and the Democratic Republic of Congo. She tells me she wanted to collect voices from a socioeconomically diverse set of Kiswahili speakers and has reached out to women young and old living in rural areas, who might not always be literate or even have access to devices. 

This kind of data collection is challenging. The importance of collecting AI voice data can feel abstract to many people, especially if they aren’t familiar with the technologies. Ryakitimbo and volunteers would approach women in settings where they felt safe to begin with, such as presentations on menstrual hygiene, and explain how the technology could, for example, help disseminate information about menstruation. For women who did not know how to read, the team read out sentences that they would repeat for the recording. 

The Common Voice project is bolstered by the belief that languages form a really important part of identity. “We think it’s not just about language, but about transmitting culture and heritage and treasuring people’s particular cultural context,” says Lewis-Jong. “There are all kinds of idioms and cultural catchphrases that just don’t translate,” she adds. 

Common Voice is the only audio data set where English doesn’t dominate, says Willie Agnew, a researcher at Carnegie Mellon University who has studied audio data sets. “I’m very impressed with how well they’ve done that and how well they’ve made this data set that is actually pretty diverse,” Agnew says. “It feels like they’re way far ahead of almost all the other projects we looked at.” 

I spent some time verifying the recordings of other Finnish speakers on the Common Voice platform. As their voices echoed in my study, I felt surprisingly touched. We had all gathered around the same cause: making AI data more inclusive, and making sure our culture and language was properly represented in the next generation of AI tools. 

But I had some big questions about what would happen to my voice if I donated it. Once it was in the data set, I would have no control about how it might be used afterwards. The tech sector isn’t exactly known for giving people proper credit, and the data is available for anyone’s use. 

“As much as we want it to benefit the local communities, there’s a possibility that also Big Tech could make use of the same data and build something that then comes out as the commercial product,” says Ryakitimbo. Though Mozilla does not share who has downloaded Common Voice, Lewis-Jong tells me Meta and Nvidia have said that they have used it.

Open access to this hard-won and rare language data is not something all minority groups want, says Harry H. Jiang, a researcher at Carnegie Mellon University, who was part of the team doing audit research. For example, Indigenous groups have raised concerns.