Illustration of person relaxing on phone
Illustration of person relaxing on phone

Florian Schottmann

Research

October 3, 2024

Turning English-centric LLMs into polyglots: how much multilingualism is needed?


Disclaimer: This article was written in 2024 and describes the situation before Textshuttle’s merger with Supertext and the subsequent relaunch at supertext.com.




The vast majority of today’s large language models (LLMs) are English-centric, having been pretrained predominantly on English text. Yet in order to meet user expectations, models need to be able to respond appropriately in multiple languages once deployed in downstream applications. This requires strong cross-linguistic transfer abilities. In this work, we investigate the minimum amount of multilingualism required during finetuning to elicit cross-linguistic generalisation in English-centric LLMs. In experiments across four LLMs, we find that multilingual instruction tuning with as few as two to three languages is both necessary and sufficient to elicit effective cross-linguistic generalisation, with the limiting factor being the degree to which a target language is seen during pretraining. Evaluations of five different tasks further reveal that multilingual instruction tuning is most beneficial for generative tasks that assume input/output language agreement, such as in chat settings, while being of less importance for highly structured classification-style tasks. Our code and data are available on Github


Read the entire research paper on arXiv.

More posts
Supertext’s secure language AI now available in 28 languages
News

Supertext’s secure language AI now available in 28 languages

July 23, 2025


Angela Lanza-Mariani