Google is getting multilingual, but is it getting the nuance?
LIMA, Peru — About 10 million people speak Quechua. Still, it has been nearly impossible to automatically translate emails and text messages into the most widely spoken native language family in the Americas.
That changed on Wednesday when Google added Quechua and several other languages to its digital translation service.
The internet giant says new artificial intelligence technology makes it possible to vastly expand Google Translate’s repertoire of the world’s languages. It added 24 this week, including Quechua and other native South American languages such as Guarani and Aymara. It also adds some commonly spoken African and South Asian languages that were missing from popular tech products.
“We looked at languages with very large, underserved populations,” Google researcher Isaac Caswell told reporters.
The news of the California company’s annual I/O technology show can be celebrated in many corners of the world. But it is also likely to receive criticism from those frustrated by past tech products that failed to grasp the nuances of their language or culture.
Quechua was the lingua franca of the Inca Empire, stretching from southern Colombia to central Chile. Its status began to decline after the Spanish conquest of Peru over 400 years ago.
Adding it to the languages recognized by Google is a big win for Quechua language activists like Luis Illaccanqui, a Peruvian who created the website Qichwa 2.0, which includes dictionaries and resources for learning the language.
“It will help make Quechua and Spanish equal status,” said Illaccanqui, who was not involved in Google’s project.
Illaccanqui, whose surname in Quechua means “you are the lightning bolt,” said the translator will also help keep the language alive with a new generation of young people and teenagers, “who speak Quechua and Spanish at the same time and are fascinated by social networks. “
Caswell called the news a “very big technological step forward” because until recently, it was impossible to add languages if researchers couldn’t find enough online text – such as digital books, newspapers, or social media posts – for their AI systems. To learn from.
US tech giants don’t have a great track record of making their language technology work well outside the wealthiest markets. This problem also made it harder for them to detect dangerous misinformation on their platforms. Until this week, Google Translate was offered in European languages such as Frisian, Maltese, Icelandic, and Corsican – each with less than 1 million speakers – but not in East African languages such as Oromo and Tigrinya, which have millions of speakers.
The new languages will be rolled out this week. They are not yet understood by Google’s voice assistant, so they are limited to text-to-text translations for now. Google said it is working on adding speech recognition and other capabilities, such as translating a sign by pointing a camera at it.
That will be important for largely spoken languages like Quechua, especially in the health field, because many Peruvian doctors and nurses who only talk to Spanish work in rural areas and “are unable to understand patients who mainly speak Quechua,” Illaccanqui said.
“The next frontier, or challenge, is to work on speech,” said Arturo Oncevay, a Peruvian machine translation researcher at the University of Edinburgh who co-founded a research coalition to improve indigenous language technology in the Americas. . “America’s native languages are traditionally oral.”
In its announcement, Google warned that the quality of translations in the newly added languages ”still lags far behind” other languages it supports, such as English, Spanish, and German, noting that the models “will make mistakes and develop their own biases.” But the company only added languages if its AI systems met a certain skill threshold, Caswell said.
“If there’s a significant number of cases where it’s very wrong, then we wouldn’t include it,” he said. “Even if 90% of the translations are perfect, but 10% are bullshit, that’s too much for us.”
Google said its products now support 133 languages. The last 24 are the largest batch added since Google incorporated 16 new languages in 2010. What made the expansion possible is what Google calls a “zero-shot” or “zero-resource” machine translation model — a model that learns to translate into another language without ever seeing a preview.
Caswell said that Google’s model works by training a “single giant neural AI model” on about 100 data-rich languages and then applying what it learned to hundreds of other languages it doesn’t know. “Imagine you’re some big polyglot, and you start reading novels in another language, then you can start thinking about what it might mean based on your knowledge of the language in general,” he said.
He said the new group ranges from smaller languages like Mizo, spoken in northeast India by about 800,000 people, to more commonly spoken languages like Lingala, spoken by about 45 million people in Central Africa.
More than 15 years ago – in 2006 – Microsoft received some positive attention in South America with a software feature that translates familiar Microsoft menus and commands into Quechua. But that was before the current wave of AI improvements in real-time translation.
Harvard University linguist Américo Mendoza-Mori, who speaks Quechua, said getting Google’s attention makes the language more visible in places like Peru, where Quechua speakers are still missing from many public services. The survival of many of these languages ”depends on their use in digital contexts,” he said.
Another linguist, Roberto Zariquiey, said he is skeptical about whether Google could make an effective language revitalization tool for Quechua, Aymara, or Guarani without closer participation from community groups in the region.
“Languages are closely linked to lives, cultures, ethnic groups, and political organizations,” said Zariquiey, a linguist at the Pontifical Catholic University of Peru. “This has to be taken into account.”
†
The newly added languages are Assamese, Aymara, Bambara, Bhojpuri, Dhivehi, Dogri, Ewe, Guarani, Ilocano, Konkani, Krio, Lingala, Luganda, Maithili, Meiteilon (Manipuri), Mizo, Oromo, Quechua, Sanskrit, Sepedi, Sorani Kurdish, Tigrinya, Tsonga, and Twi.
†
O’Brien reported from Providence, Rhode Island.