Field Stories

Opinion Piece: Donald Clark on Why AI is Africa’s Friend – Not Its Enemy

AI is Africa’s friend not its enemy!

Although I speak English, I’m not English, but Scottish, where the culture is open, talkative and based on humour. That doesn’t make me an expert on vastly different cultures, social norms and linguistic nuances in Africa, but I work in AI and that helps!

Yes, colonial erasure, language marginalisation and economic inequality are real, but it would be easy to use this as a reason for doing nothing, rejecting what is the technology of the age – that would be a mistake.

Partisan biases, particularly an over-reliance on English, Chinese or Western perspectives, could marginalise non-English cultures and languages in Africa, and could lead to even further inequality. A common charge against the large foundational models and services in AI is that they have a skew towards English and other major languages, at the expense of minority languages.

Like the person in Ireland who stopped an Irishman and said, “How do I get to Dublin?” hearing the reply, “Well I wouldn’t have started from here!’ let’s keep our sense of humour but also take this joke seriously. You have to start from somewhere with building AI models and you can’t start from everywhere at once. If these companies had said, we won’t launch ChatGPT and so on until we have all 7100 languages in the world, including all 2000 plus African languages, covered, it would have taken decades, as these things take time. Even then the task is almost impossible. Let me explain why.

You can’t make a clay pot without clay, and a huge number of languages have no substantial written repositories (the clay I refer to in my analogy). Out of the 7100 known languages globally, around 3500 have no formal written system and exist purely in oral form. In Africa, out of approximate 2150 languages, over 1500 are oral-only languages, with no writing system. Others with writing have little data online, and what text there is, often have colonial influences, such as the Bible translations into those languages and administrative records. Let us not to underplay the colonial legacy, but if anything, AI is working to halt and solve these problems, not exacerbate them.

English, French, Arabic, Swahili, Amharic and Somali are covered well by AI, and account for over 560 million speakers in Africa. That is a reasonable start in just over two years. Other languages with some capabilities include; Hausa, Igbo, Yoruba, Zulu, Xhosa and Wolof. Sure, some of these languages are heavily skewed by subject matter, uneven in translation and with varying degrees of performance, but Rome, or Dar es Salaam, wasn’t built in a day, even two years, and with each passing week we see improvements in the presence, breadth and depth of African languages in AI services.

We’re getting there

I have an avatar, Digital Don. It is pretty realistic, with my Scottish accent. I have also used it to speak in Swahili (Kenyan and Tanzanian variants), Somali, Amharic and Zulu. That was in the first year of generative AI. Fast-forward to today, and I have found the presence and capability of African languages in AI rising in only a few short years. Check this video out for example!

It is true that only a few dozen African languages (primarily those with some web presence) appear in the training corpus for current models, with Bible translations often being the largest readily available sources corpora for ‘low-resource’ languages. This is followed by government, administrative and legal documents such as published laws, constitutions or educational materials in indigenous languages. Lastly they use news articles, forums and general web data that can be scraped via web crawlers. Given that is all that was and is available, I believe they are trying with the little clay that they have.

And here is something fascinating; you don’t always need the clay or data to deal with a new language, although it helps. An interesting feature of Multilingual Large Language Models (MLLMs) is their ability to translate languages not even in the training set yet! Google, in particular, have been looking at this approach.

English as lingua franca

Duolingo has been offering courses in Swahili since 2017, now Zulu and Xhosa have made a debut, but it is a raw fact that nearly half of all Duolingo learners are learning English, as it is now the world’s lingua franca, an official language in countries like Nigeria, South Africa, Kenya, Ghana, Uganda and Tanzania. Over time, English evolved into the global lingua franca, first in the United States as part of the British Empire. The prominence of English thus reflects a legacy of colonialism by the English of the US, resulting in English becoming its dominant language, now widely used in international travel, trade, and professional communication.

Today, its dominance is not so much a colonial force but more of a practical outcome, based on the pragmatic need for standardized and universally accessible communication. For example, it is the mandated language of the International Civil Aviation Organisation (ICAO) for international pilots and air traffic controllers, as well as in the maritime industry – where the Standard Maritime Communication Phrases (SMCP) require English for international maritime communications. Science and academia rely on an English dominant scientific journal base too, as does medicine and healthcare. Business and finance rely on English for international trade, finance, corporate governance, legal documents and communication between multinational corporations. This is also the case with media, entertainment, tourism and hospitality. This may have had historical and colonial roots, mainly the colonisation of North America and the adoption of English by the US, and the size of the British Empire, but it is now a practical issue built on the useful need for standardization and global interoperability.

AI preserves dying languages

AI helps preserve dying languages by capturing what data does exist, as it scrapes the web and other sources. Wikipedia is a good example, as the amount of data a language has is well represented by its presence on Wikipedia. 

African languages are thriving on Wikipedia, with over 40 represented, so the continent’s written linguistic richness is now finding a digital home. Among the most active African languages are Swahili, Yoruba and Afrikaans, each hosting thousands of articles and an engaged community of editors. There are also substantial presences in Amharic, Hausa, Igbo, Shona, Somali, Zulu, Kinyarwanda, Wolof, Lingala, Tigrinya, Bambara, Tswana, and Sotho, all steadily expanding their footprints. You will also find content in Xitsonga, Kirundi, Fula, Kanuri, and Kabyle. The story doesn’t stop there. Dozens of other African languages are finding a place and presence in Wikipedia, sure some are low-volume, others still under construction in the Incubator. These include Ewe, Akan, Dagbani, Tshivenda, Gikuyu, Fon, Sango, Tumbuka, Tsonga, Ndebele, Chichewa, Luganda, Ndonga, Venda, Kongo, and Sesotho sa Leboa. Even Maasai is in the mix, currently building momentum, along with various Berber dialects like Tarifit, all making their mark. As I said, it takes time, but look how much progress we are already making!

Translating into English and back into the source language has proved useful, as have other technical fine-tuning solutions but nothing will beat the hard work being done by the Community with hundreds of researchers from 30 African countries to solve these problems. Mozilla’s Common Voice project is doing great work in capturing the spoken word in recorded speech, an approach that hold great promise, as it can be turned into text and also represents the true use of the language in its oral, cultural context. 

In Closing

This is not just about translating knowledge, it is also about preserving identity, culture and the power of local languages in the digital age. The work of Noelani Arista in Hawai’i and the proactive work on Icelandic with OpenAI, are good examples.

Google’s ‘1000 Languages Project’ is a long-term commitment to build speech and text AI models that can understand and generate 1,000+ languages. They have invested heavily in open data collection and community partnerships, also supporting crowdsourcing platforms to collect text and speech data in underrepresented languages. 

Meta’s ‘No Language Left Behind’ (NLLB) work aims to close the language equity gap in AI, with a focus on low-resource languages, particularly in Africa, South Asia, and Southeast Asia. They have released NLLB-200: a translation model that supports 200 languages, including 55 African languages, such as Wolof, Hausa, Bambara, Swahili, Xhosa, Zulu, Igbo, Yoruba and Lingala to name a few. Meta created a massive multilingual dataset called FLORES-200 to improve translation accuracy in very low-resource settings, which also supports Wikipedia article translation.

Importantly, all of this is open-source, the data and training code, so developers and researchers can use and build on it. Meta has also collaborated with Masakhane, local linguists, and researchers across Africa and Asia as a way to expand this data.

Without AI, small minority languages are at even greater risk of disappearing. By collecting and archiving data from a wide range of online sources, AI can help preserve content that human archivists might overlook – especially as websites vanish and speaker populations decline. This is particularly urgent as other languages grow more dominant and aspirational, especially among younger generations. As AI can document and transcribe oral speech into text, this can preserve the oral dimension of the language and be turned into useful data for training models. This extends into more complex aspects of speech, such as actual use, dialects, and pronunciation. All of this is useful in not only preserving these endangered languages, but also allowing resources to be used in teaching, learning and assessment. Chatbots and educational tools can actually revitalise a language, literally spreading the word and use among young speakers. This all helps to not just preserve but also revitalise an endangered language.

On productivity, it is important that people do not let the past define their future. Don’t be at the receiving end of tech, help shape it. We need local AI development ecosystems, more African AI engineers and businesses to join the movement. Africans building AI for Africans, rooted in local needs and values. That future can be created by pulling together what data does exist and creating new data. This will take both time and effort but it is vital if all countries and communities around the world are to take advantage of the education, health and government opportunities that AI offers in terms of productivity and economic growth for the many not the few.

Going back to asking directions from our wise Irishman. How do I get to where I want to go? First work out where you are, plan your own route, take the first step, then the next and the next, and you’ll get there. The worst thing you can do, is sit down, get nowhere, and blame the Irishman for not liking the English!

Leave a Comment

Your email address will not be published. Required fields are marked *

*