Google, in partnership with a consortium of leading African research institutions, has today announced the launch of WAXAL, a large-scale, openly accessible speech dataset designed to accelerate research and support the development of inclusive artificial intelligence technologies across the continent.
The dataset addresses a long-standing gap in AI development by providing foundational speech data for 21 Sub-Saharan African languages, including Hausa, Luganda, Yoruba, Acholi, and Swahili. Collectively, these languages are spoken by more than 100 million people across over 25 countries.
Addressing Africa’s voice technology gap
Despite rapid global adoption of voice-enabled technologies, Africa has remained largely excluded due to the scarcity of high-quality speech data for its more than 2,000 languages. This data gap has limited the development of speech recognition systems, virtual assistants, and voice-based digital services for millions of African users.
WAXAL was developed to directly address this challenge. Built over a three-year period with funding from Google, the dataset contains over 1,250 hours of transcribed natural speech, alongside more than 20 hours of studio-quality recordings designed to support the creation of high-fidelity synthetic voices.
By making this data openly available, the initiative lays the groundwork for African developers, researchers, and startups to build voice technologies that understand local languages and accents.
Built by African institutions, for African communities
A defining feature of WAXAL is its community-led development model. Data collection was led by African academic and community organizations, including Makerere University (Uganda), the University of Ghana, and Digital Umuganda (Rwanda), working closely with Google research teams.
Under this model, partner institutions retain full ownership of the data, establishing a new framework for equitable and partnership-driven AI development on the continent.
“This dataset provides the critical foundation for students, researchers, and entrepreneurs to build technology on their own terms, in their own languages,” said Aisha Walcott-Bryantt, Head of Google Research Africa.
“The ultimate impact of WAXAL is the empowerment of people in Africa. We look forward to seeing African innovators use this data to create everything from new educational tools to voice-enabled services that create tangible economic opportunities across the continent.”
Languages covered by the WAXAL dataset
The dataset includes speech data for the following languages:
Acholi, Akan, Dagaare, Dagbani, Dholuo, Ewe, Fante, Fulani (Fula), Hausa, Igbo, Ikposo (Kposo), Kikuyu, Lingala, Luganda, Malagasy, Masaaba, Nyankole, Rukiga, Shona, Soga (Lusoga), Swahili, and Yoruba.
These languages span East, West, Central, and Southern Africa, reflecting a broad geographic and linguistic representation.
Ethics, privacy, and open access
Ethical considerations and participant privacy were central to the project’s design. All contributors provided informed consent, and personally identifiable information was manually removed from the dataset. WAXAL is released under the Creative Commons CC-BY-4.0 license, allowing broad use while ensuring proper attribution.
The full dataset is available starting today on Hugging Face, enabling immediate access for researchers, developers, and institutions worldwide.
Unlocking inclusive digital growth
By lowering barriers to speech and language research, WAXAL is expected to accelerate the development of voice-enabled education tools, accessibility solutions, public service applications, and locally relevant digital products.
As Africa’s digital economy grows, initiatives like WAXAL position local languages not as obstacles to innovation, but as critical assets for inclusive technological development.