AI ‘rescues’ a country with 700 languages
With more than 300 ethnic groups speaking more than 700 different languages, Indonesia poses a new challenge for LLMs that are trained in English such as ChatGPT and Gemini.
Indonesia develops multilingual LLMs for languages with limited learning resources and at risk of extinction. Photo: Shutterstock.
Growing up in Banyuwangi province, Indonesia, Antariksawan Jusuf (58 years old) spoke Bahasa Osing with family and friends. It wasn’t until he entered university in Bali, where he had to speak the national language Bahasa Indonesia, that he realized that the Osing was in danger of extinction.
“The Osing language is threatened by the modernization process. Nowadays, many parents prefer to use Bahasa Indonesia when communicating with their children,” Antariksawan told Rest of World.
Osing is not the only language at risk of being wiped out. Indonesia has more than 700 regional languages and nearly 800 dialects across its vast territory. But according to researchers, more than 400 dialects are at risk of extinction by the end of the 21st century.
Therefore, the country’s government has turned to AI to help preserve languages and make them more accessible to the people.
The more indigenous the language, the more limited the learning resources
Popular large language models (LLMs), such as OpenAI’s GPT, Google’s Gemini, and Meta’s Llama are all trained in English.
Therefore, non-English speaking countries are trying to close the gap by building multilingual LLMs for low-data and endangered languages. These languages are widely used in practice but there is not much data on the Internet.
Current models are mainly trained in English. Photo: Shutterstock.
Speaking to Rest of World, Endang Aminudin Aziz – head of the language development agency at Indonesia’s Ministry of Education and Culture – said society is moving towards monolingualism due to globalization and modernization. “We are working to revive languages to prevent them from becoming extinct. I think AI and LLM technology will be useful,” he said.
For LLM training, they need large quantities of high-quality data, including books, media and academic documents, as well as open source repositories such as GitHub.
According to Nuurrianti Jalli, assistant professor at Oklahoma State University, because regional languages lack learning resources, many people are concerned about whether they can accurately represent cultures when digitized. “Where does the data come from? Who is behind them” are the questions LLM developers ask.
This is even more important in a country where information is censored everywhere and information is strictly controlled by the government like Indonesia. Jalli believes that diverse data sources are needed to ensure that the output of the LLM is comprehensive and unbiased.
“Involving a wide range of experts, including those who do not agree with the government, can help ensure that the context of the data is accurately presented. This is especially important when data can be manipulated to favor certain political groups,” Jalli told Rest of World.
How does Indonesia preserve languages using AI?
Earlier this year, Yellow.AI launched Komodo-7B, an LLM offered in Bahasa Indonesia and 11 other regional languages including Javanese, Balinese and Sundanese.
The model uses Indonesian textbooks, along with other data sources to ensure diversity, co-founder Rashid Khan shared with Rest of World. Khan said Komodo-7B is currently aimed at profitability and not at preserving local languages and dialects. But this goal is still very feasible in the near future.
According to him, this will require “a high level of information digitization” and can only happen when the community joins hands. “LLM training will become easier, once we reach a very high level of digitalization. That is when a particular language is represented in the form of books, articles, poems – all these things – that are available on the Internet,” Khan said.
But for now, the majority of training data is still in English. “If that continues to happen, some other languages will be left behind,” he said.
To date, other than Bahasa Indonesia, there are only two regional languages with digitized texts – Balinese and Makassarese. Antariksawan hopes that Osing will soon be on this list.
He assisted in publishing the Bahasa Osing – Bahasa Indonesia dictionary, and wrote a novel in two languages. He also founded the Osing language and culture preservation community, which publishes short stories, novels and videos about folk tales and children’s songs.
For Antariksawan, this is just the beginning. The Osing language has existed since the 13th century, so he was determined to preserve it for future generations. “My hope is that the younger generation can learn Osing from a young age and won’t have as much difficulty as I did finding materials in that language. I hope AI and LLM technology can take us to the next level,” he said.
Post Comment