The Pareto frontier of foreign languages


I realized recently that I’ve learned one new language per complete decade of my life ($n=2$). I am now over 30% of the way through decade #3, however, and without a significant course correction this statistic will not hold by the time I reach 30. I decided just the other day that I want to course correct, but this leaves an obvious question: what language should I learn?

Criteria

The following analysis may not be right for you. For me, though, in this decade, I roughly care about two questions:

  1. How many hours of study will it take to achieve proficiency? The fewer, the better.
  2. How many people in the world, with whom I can’t easily converse right now, could I converse with after learning this language? The more, the better.1

Ideally, I would like to answer these questions in a general way so that people of all backgrounds can use the result. That is, given any starting set of languages, I’d like to be able to compute an answer for each question. This is surprisingly tough.

Question 1

It would be nice to have a dataset that gives the average number of study hours it takes someone who knows only language X to learn language Y. I have not been able to find this for arbitrary X and Y (if it does exist, please let me know!), but the US Foreign Service Institute (FSI) does have a School of Language Studies with some rough estimates when the starting language is English, which might be helpful.

At a high level, we want to get at how much “work” it will be to learn language Y. I’ve phrased this in terms of “study hours”, but would be okay with any metric that might be pretty correlated. One such metric is lexical similarity, which according to Wikipedia “is a measure of the degree to which the word sets of two given languages are similar”. It is essentially a metric that captures the frequency of cognates between two languages. This does not sound terrible; one concern is that it may fail to account for grammatical difficulties, but at least probably lexical similarity is a good predictor of the distance between two languages in the great tree of human languages, and probably that distance is a good predictor of grammatical difference. Still, there are some languages that are as far as I can tell intrinsically more complicated than others (for example, Hungarian has 18 noun cases!) and I don’t think this metric really captures this, but it might be okay.

Helpfully, one of the references on Wikipedia brings us to A Similarity Database of Modern Lexicons, the result of a 2021 study which gives pairwise lexical similarities for 331 languages using what seems like a reasonable method.2 We can do a bit of a sanity check using the FSI estimates. They classify languages into four categories of increasing difficulty for native English speakers. If we plot the average lexical similarity between English and the target language for each category, we get the following:

Average lexical similarity among FSI language categories

It’s decreasing, which is good. There’s a lot more we could dig into here, but for now I’m going to call this a good enough metric to proxy for an answer to question 1: we can take the maximum similarity across all the languages that I know with the target language.

Question 2

The key clause in question 2 is “with whom I can’t easily converse right now.” We can probably get a pretty good answer to the question “how many people speak language Y?”, but in the extreme case it’s possible that almost everyone who speaks language Y also speaks a language that I already know. For example, ~89% of people in Sweden speak English, which makes it unlikely that learning Swedish is the best choice for me given my criteria. What I really want to know is the answer to a question like “how many people speak language Y but not either of languages A or B?”

I’m still not sure how to answer this general question. Someone somewhere might have data on how many people speak, say, all possible subsets of size 3 or smaller of languages, but in my (short) time searching I did not find this person. We will try our best without such detailed data; suppose we’re thinking about learning language Y. Let’s denote by $Y$ the set of all language Y speakers. Then, if we denote by $A$ the set of language Y speakers who also speak language A, and denote by $B$ the set of language Y speakers who also speak language B, what we want mathematically is $|Y| - |A\cup B|$, or, in words, the number of language Y speakers minus the number of language Y speakers who also speak either language A or language B. If you draw a nice Venn diagram, you can convince yourself of the inclusion-exclusion principle which says that $$|A\cup B| = |A| + |B| - |A\cap B|,$$ or, in words, that the size of the union of the two sets is the sum of their individual sizes minus the size of their intersection.

I’m not sure how helpful this is for general languages A and B, but in my particular case, language A is English and language B is Spanish. For a general candidate target language Y, I don’t think it’s a terrible assumption that $|A| \gg |B|$, and also that $|B|$ is close-ish to $|A\cap B|.$ That is, in general, people who speak language Y (which is neither English nor Spanish) are much more likely to know English than Spanish, and also conditional on knowing Spanish they are pretty likely to know English. These claims are based on vibes, and I would be happy to hear about counterexamples (maybe Portuguese?), but for now we will assume they mostly hold up. To get a reasonable specific-to-me response to question 2, then, it suffices to estimate

Ethnologue has a 2022 list of the top 200 languages by number of speakers. It costs $249 to download the data, but the top 20 are available for free and I am reasonably confident that the ideal language for me to learn this decade is among them.3 It remains to estimate the fraction of speakers that also speak English for each of these languages. A decent first-pass is to use the percentage of English speakers in the country where the plurality of the language speakers reside. I started with this and made some language-dependent adjustments where it made a big difference – for example, it’s pretty important that there are more French speakers in Africa than France.

Choosing a language

Great, so now we (kind of) have answers to both questions! In particular, we will prioritize languages whose 1) max similarity to English or Spanish is bigger and 2) speakers tend not to also speak English. But it’s unlikely that one language maximizes both of these features, and we haven’t yet decided how we feel about trade-offs between 1) and 2). These lexical similarity scores are also a bit nebulous, and I don’t really know how to think about how they translate to difficulty of learning. However, even before dealing with either of these problems, we can actually rule out most of the languages on our list using the following idea: if some candidate language L has both greater similarity to the languages I already know and more non-English speakers than some other candidate language M, I would always prefer to learn L rather than M. The mathematical terminology for this is that language L dominates language M, since in every category it wins.

Let’s plot all of the candidate languages, where one axis is their similarity to my current reportoire and the other is number of speakers who don’t speak English.

Languages that are not dominated by any others are colored purple, so that we end up with just five candidate languages: Mandarin Chinese, Hindi, Russian, French, and Portuguese. These languages lie on what is known as the Pareto frontier of our space of languages and features! There is a formal mathematical definition at the Wikipedia link, but the gist is that the Pareto efficient solutions to our language choice problem are those that are not dominated by any other candidate language.

Now, of course, I still have to decide how to make trade-offs between these, but the great thing is that I only need to think about these five. Even though the idea of removing all dominated solutions is simple, it’s actually really powerful since we didn’t need to know anything about how to feel about trade-offs between the two axes we were considering.

Pareto efficiency appears everywhere. For example, assuming markets work, which, depending on who you ask, might be a big assumption, almost all of our daily consumption decisions should involve a choice among Pareto efficient options. That’s because if some choice of product dominated another (in other words, if it were cheaper and better in every relevant way) it would, in theory, drive all of the products it dominates out of business! Stated another way, to the extent that you believe that the efficient-market hypothesis applies in some context, you should probably be skeptical any time it seems like one choice in that context is better on all axes than another; what axes aren’t you considering?

Improvements

There are quite a few improvements to be made here. For question 1, as mentioned earlier, I don’t think lexical similarity captures close to all that we want when it comes to “difficulty”, so it would be interesting to think about how to improve that metric with others. Also, even if we do only use lexical similarity, it’s probably not correct to simply take the maximum similarity across all of my known languages; if for example I knew all Romance languages except for one, it’s almost certainly easier for me to pick up that last one than for someone who only speaks the Romance language closest to it.

I really wanted to get a more general solution to question 2, and would be happy to hear about any ideas for how to do so. At the same time, I think for my current set of languages, the non-English speaking population estimate is not so bad. It would also be interesting to revise question 2 a bit and think about how its answer will change over time. As it turns out, a not insignificant factor in my final decision was Africa’s total fertility rate, and, in the end, I decided to go with French! 🇫🇷

Lastly, in truth I probably care about more than just these two questions. Thankfully, the Pareto frontier will exist no matter how many dimensions we decide to care about, although as the number of dimensions increases, the size of the frontier should also increase. The intuitive reason for this is that as we start to care about more axes, it becomes “harder” for one language to beat another on every single axis, and so fewer languages end up being dominated.

If you want to check out the code, I’ve uploaded it to GitHub here. And if you have any feedback or comments, reach out on Twitter or over email!


  1. we can stick with two dimensions for now, but to be clear I also kind of care about 3. How fun is the music? 🎶 ↩︎

  2. Gábor Bella, Khuyagbaatar Batsuren, and Fausto Giunchiglia. A Database and Visualization of the Similarity of Contemporary Lexicons. 24th International Conference on Text, Speech, and Dialogue. Olomouc, Czech Republic, 2021. ↩︎ ↩︎

  3. we’ll actually end up with 16 candidate languages, since I know two of them and also some lexical similarities were missing in 2 for Yue Chinese and Nigerian Pidgin. ↩︎