How to detect (western) language with Python


Various options for (western) language detection

In order to optimise an NLP preprocessing pipeline, or to be able to tag a batch of documents and to present a user only with results in their preferred language, it might be useful to automatically determine the language of a text sample.

This article presents various ways of doing so in Python, from custom solutions to external libraries. Each solution is evaluated according to three dimensions, accuracy in language detection, execution time and ease of use.

Experimental setup

We use the genesis corpus from nltk, which has the advantage of being easily available. You can download it after installing nltk, as shown below:

In [1]:
import nltk
nltk.download('genesis')
[nltk_data] Downloading package genesis to /home/sdg/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
Out[1]:
True

The genesis corpus contains the text from the book of Genesis in 6 languages: Finnish, French, German, Portuguese, Swedish, and English (the latter in three different versions).

The writing style might not be representative of the typical context in which language detection is used (very formal and rather outdated), but it had the advantage of alredy being labeled.

In all of the following solutions, the genesis corpus will be used solely for testing. When we train our classifier for custom solutions, we will use other data sources.

We will compute accuracy when predicting each sentence of the corpus, and the execution time for predicting the complete dataset.

External depencies in addition to nltk are numpy and pandas.

Dataset creation

We create a Pandas dataframe containing all sentences with their associated labels.

In [2]:
import pandas as pd
import numpy as np
from nltk.corpus import genesis as dataset

dfs  = []
for ids in dataset.fileids():
    df = pd.DataFrame(data=np.array(dataset.sents(ids)), columns=['sentences'])
    df['label'] = ids.strip('.txt') if ids not in {'english-kjv.txt', 'english-web.txt', 'lolcat.txt'} else 'english'
    dfs.append(df)
sentences = pd.concat(dfs)

Naive solution (baseline)

We present here a naive solution relying on stop words (most common words in a language). We will use the stop words corpus from nltk.

We first create a dictionary of stop words per language. It must be noted that this dictionnary includes languages which are not present in the genesis corpus, such as Norwegian or Danish. This ensures a fair comparison between custom solutions and external libraries (which have no restriction on which languages might be present).

In [3]:
from nltk.corpus import stopwords
from collections import defaultdict

languages = stopwords.fileids()
stopwords_dict = defaultdict(list)
for l in languages:
    for sw in stopwords.words(l):
        stopwords_dict[sw].append(l)

For each sentence (represented as a list of tokens), we compute the number of stop words of each language present in the sentence, using a dictionary to accumulate the counts. Then, we simply predict the sentence to be written in the language with the largest count (if the dictionary is not empty; else we predict 'unknown').

In case of equality, we toss a coin and choose at random.

In [4]:
from collections import defaultdict, Counter
import random

def predict_language_naive(sentence):
    random.seed(0)
    cnt = Counter()
    cnt.update(language
              for word in sentence
              for language in stopwords_dict.get(word, ()))
    if not cnt:
        return 'unknown'
        
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])

We can compute the accuracy as follow :

In [5]:
def compute_accuracy(predictor):
    return (sentences['sentences'].apply(predictor) == sentences['label']).sum() / len(sentences)
In [6]:
compute_accuracy(predict_language_naive)
Out[6]:
0.92565982404692082

As a side note, accuracy might not be the ideal metrics here, since we have a slightly unbalanced class distribution, with English being 3 times as frequent as any other language.

Execution time is computed using the timeit magic.

In [7]:
%timeit compute_accuracy(predict_language_naive)
299 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This solution is quite fast, but not very accurate. It does not use any external library which might be an advantage in some contexts.

External libraries

Now that we have a baseline, we can benchmark a few external libraries to see how well they perform. We know they will probably be more accurate, but at what cost in term of execution time?

We tested 2 libraries, langdetect and pycld2.

langdetect

The official documentation can be found here. It's a port of a Google library in Python. Unfortunately, the code is not very Pythonic...

However, tt's easily installed with pip.

In [8]:
from langdetect import detect, lang_detect_exception

The langdetect API takes whole sentences (not tokenised) as input, so we first concatenate tokenised sentences.

Another thing is that the detect function may raise an exception when it is unsure about the language, in which case we want to have an unknown label. Our wrapper should catch those exceptions.

Another thing we want to consider is that the output is the ISO 639-1 code for the language, which is not very user-friendly. We use a mapping dictionary to convert the output.

In [9]:
iso_to_human = {'aa': 'afar', 'ab': 'abkhazian', 'af': 'afrikaans', 'ak': 'akan', 'am': 'amharic', 'an': 'aragonese', 'ar': 'arabic', 'as': 'assamese', 'av': 'avar', 'ay': 'aymara', 'az': 'azerbaijani', 'ba': 'bashkir', 'be': 'belarusian', 'bg': 'bulgarian', 'bh': 'bihari', 'bi': 'bislama', 'bm': 'bambara', 'bn': 'bengali', 'bo': 'tibetan', 'br': 'breton', 'bs': 'bosnian', 'ca': 'catalan', 'ce': 'chechen', 'ch': 'chamorro', 'co': 'corsican', 'cr': 'cree', 'cs': 'czech', 'cu': 'old bulgarian', 'cv': 'chuvash', 'cy': 'welsh', 'da': 'danish', 'de': 'german', 'dv': 'divehi', 'dz': 'dzongkha', 'ee': 'ewe', 'el': 'greek', 'en': 'english', 'eo': 'esperanto', 'es': 'spanish', 'et': 'estonian', 'eu': 'basque', 'fa': 'persian', 'ff': 'peul', 'fi': 'finnish', 'fj': 'fijian', 'fo': 'faroese', 'fr': 'french', 'fy': 'west frisian', 'ga': 'irish', 'gd': 'scottish gaelic', 'gl': 'galician', 'gn': 'guarani', 'gu': 'gujarati', 'gv': 'manx', 'ha': 'hausa', 'he': 'hebrew', 'hi': 'hindi', 'ho': 'hiri motu', 'hr': 'croatian', 'ht': 'haitian', 'hu': 'hungarian', 'hy': 'armenian', 'hz': 'herero', 'ia': 'interlingua', 'id': 'indonesian', 'ie': 'interlingue', 'ig': 'igbo', 'ii': 'sichuan yi', 'ik': 'inupiak', 'io': 'ido', 'is': 'icelandic', 'it': 'italian', 'iu': 'inuktitut', 'ja': 'japanese', 'jv': 'javanese', 'kg': 'kongo', 'ki': 'kikuyu', 'kj': 'kuanyama', 'kk': 'kazakh', 'kl': 'greenlandic', 'km': 'cambodian', 'kn': 'kannada', 'ko': 'korean', 'kr': 'kanuri', 'ks': 'kashmiri', 'ku': 'kurdish', 'kv': 'komi', 'kw': 'cornish', 'ky': 'kirghiz', 'la': 'latin', 'lb': 'luxembourgish', 'lg': 'ganda', 'li': 'limburgian', 'ln': 'lingala', 'lo': 'laotian', 'lt': 'lithuanian', 'lv': 'latvian', 'mg': 'malagasy', 'mh': 'marshallese', 'mi': 'maori', 'mk': 'macedonian', 'ml': 'malayalam', 'mn': 'mongolian', 'mo': 'moldovan', 'mr': 'marathi', 'ms': 'malay', 'mt': 'maltese', 'my': 'burmese', 'na': 'nauruan', 'nd': 'north ndebele', 'ne': 'nepali', 'ng': 'ndonga', 'nl': 'dutch', 'nn': 'norwegian nynorsk', 'no': 'norwegian', 'nr': 'south ndebele', 'nv': 'navajo', 'ny': 'chichewa', 'oc': 'occitan', 'oj': 'ojibwa', 'om': 'oromo', 'or': 'oriya', 'os': 'ossetian', 'pa': 'punjabi', 'pi': 'pali', 'pl': 'polish', 'ps': 'pashto', 'pt': 'portuguese', 'qu': 'quechua', 'rm': 'raeto romance', 'rn': 'kirundi', 'ro': 'romanian', 'ru': 'russian', 'rw': 'rwandi', 'sa': 'sanskrit', 'sc': 'sardinian', 'sd': 'sindhi', 'sg': 'sango', 'sh': 'serbo-croatian', 'si': 'sinhalese', 'sk': 'slovak', 'sl': 'slovenian', 'sm': 'samoan', 'sn': 'shona', 'so': 'somalia', 'sq': 'albanian', 'sr': 'serbian', 'ss': 'swati', 'st': 'southern sotho', 'su': 'sundanese', 'sv': 'swedish', 'sw': 'swahili', 'ta': 'tamil', 'te': 'telugu', 'tg': 'tajik', 'th': 'thai', 'ti': 'tigrinya', 'tk': 'turkmen', 'tl': 'tagalog', 'tn': 'tswana', 'to': 'tonga', 'tr': 'turkish', 'ts': 'tsonga', 'tt': 'tatar', 'tw': 'twi', 'ty': 'tahitian', 'ug': 'uyghur', 'ur': 'urdu', 've': 'venda', 'vi': 'vietnamese', 'vo': 'volapük', 'wa': 'walloon', 'wo': 'wolof', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'za': 'zhuang', 'zh': 'chinese', 'zu': 'zulu'}


def detect_without_exception(s):
    try:
        return iso_to_human[detect(' '.join(s))]
    except lang_detect_exception.LangDetectException:
        return 'unknown'

We had the following results in terms of prediction accuracy and execution time:

In [10]:
compute_accuracy(detect_without_exception)
Out[10]:
0.96539589442815255
In [11]:
%timeit compute_accuracy(detect_without_exception)
51.3 s ± 966 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We saw improvements in classification accuracy, at the expense of being over 150 times slower: unacceptable in most use cases.

pycld2

pycld2 provides Python bindings around the Google compact language detection library (CLD2).

The API exposes more details than langdetect, providing a confidence percentage for each language detected, and since it's a wrapper on a C++ compiled binary, we can hope that it'll be faster.

It's easily installed with pip.

It is the underlying library used by Polyglot, a NLP library offering a wide variety of tools for handling multilingual usages. You should definitely check it out!

Just like langdetect, pycld2 takes whole sentences as input, so we will reuse our previously defined sentences_agg.

In [12]:
import pycld2 as cld2

compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True)[2][0][0].lower())
Out[12]:
0.97375366568914956
In [13]:
%timeit compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True)[2][0][0].lower())
134 ms ± 776 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The accuracy is actually slightly better than what we had with langdetect, and it's even faster than our naive solution.

The downside is that the GitHub repository has not been updated since 2015, and the documentation seems out of sync. Furthermore, the computation is not made in Python, which makes it harder to alter the code to suit custom needs.

One last thing we can try is to bias the algorithm towards choosing English more often, given that it is the more frequent language.

In [14]:
compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True, hintLanguage='ENGLISH')[2][0][0].lower())
Out[14]:
0.96796187683284463

In this example, there was no improvement to accuracy—most likely because we have such short pieces of text to label. However, the bias but might be of use in other contexts.

Improvements on the naive solution

Can we beat the 97% accuracy of an off-the-shelf solution? Let's try improving our naive solution.

Training dataset

In order to improve our naive solution, we will need another source of multilingual text—using the genesis corpus would be cheating since it's our test set.

We've used the European Parliament Proceedings Parallel Corpus instead. You can download it with nltk.

In [15]:
from nltk.corpus import europarl_raw

We obtain the list of words for each language, as follows:

In [16]:
europarl_raw.english.words()
Out[16]:
['Resumption', 'of', 'the', 'session', 'I', 'declare', ...]

We define the list of languages for which we have data:

In [17]:
languages = ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'portuguese', 'spanish', 'swedish']

We also define a small function to help us clean our lists of tokens.

In [18]:
def clean_tokens(tokens):
    return [token.lower() for token in tokens if token.isalpha()]

Weight stop words

We know that some stop words are present in more than one language. We can consider that these words are less discriminant with respect to the languages they belong to, so we want to assign them a weight proportional to how frequent those stop words are within the set of all languages.

In [19]:
weighted_stopwords_dict = defaultdict(dict)
for sword, langs in stopwords_dict.items():
    coeff = 1/ len(langs)
    for lang in langs:
        weighted_stopwords_dict[sword][lang] = coeff
In [20]:
def predict_language_weighted_stopwords(sentence):
    random.seed(0)
    cnt = Counter()
    for word in sentence:
        if word in weighted_stopwords_dict:
            cnt.update(weighted_stopwords_dict[word])

    if not cnt:
        return 'unknown'
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])
In [21]:
compute_accuracy(predict_language_weighted_stopwords)
Out[21]:
0.92184750733137832
In [22]:
%timeit compute_accuracy(predict_language_weighted_stopwords)
413 ms ± 47.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Unfortunately, this weighting scheme does not improve our naive solution.

Using diacritics

Diacritics are, as defined by Wikipedia, glyphs added to a letter. They can be quite distinctive of a given language (if present), and can therefore be used in addition to stopwords to improve classification accuracy for western languages.

First, we need to determine a list of diacritics used per language. We will use the European Parliament Proceedings to do so.

In the first line of the function, we get a list of all characters present in the proceedings for a given language, after cleaning the tokens (we keep only alphabetic words and we set everything to lower case).

Then we count the number of occurences for each character. We remove characters occuring less than 500 times, since they can come from foreign words (such as surnames or location names) and we only want to keep native diacritics for each specific language.

In the last step, we remove non-accentuated characters (= ascii characters) from the set.

In [23]:
import string

def get_diacritics(language):
    char_list = list(''.join(clean_tokens(europarl_raw.__getattribute__(language).words())))
    cnt = Counter(char_list)
    frequent_chars = {k for k, v in cnt.items() if v > 500}
    return frequent_chars - set(string.ascii_lowercase)

Let's print the list of diacritics per language.

In [24]:
diacritics = {language: list(get_diacritics(language)) for language in languages}
diacritics
Out[24]:
{'danish': ['æ', 'å', 'ø', 'é'],
 'dutch': ['ë', 'é'],
 'english': [],
 'finnish': ['ö', 'ä'],
 'french': ['à', 'û', 'ô', 'ê', 'è', 'ç', 'é', 'î'],
 'german': ['ö', 'ü', 'ä', 'ß'],
 'italian': ['à', 'ò', 'ù', 'è', 'ì', 'é'],
 'portuguese': ['à', 'ú', 'ê', 'ã', 'ç', 'á', 'é', 'í', 'ó', 'õ', 'â'],
 'spanish': ['ú', 'ñ', 'á', 'é', 'í', 'ó'],
 'swedish': ['ö', 'å', 'ä']}

The lists seem about right (at least for those languages I know), and for a naive solution, it's running reasonably fast.

Now that we have a list of diacritics, we can use the same method we used to detect languages using stop words.

At first, let's try using only diacritics.

In [25]:
diacritics_transposed = defaultdict(list)
for language, chars in diacritics.items():
    for char in chars:
        diacritics_transposed[char].append(language)

        
def predict_language_diacritics(sentence):
    cnt = Counter()
    cnt.update(language
             for ch in ''.join(sentence).lower()
             for language in diacritics_transposed[ch]
             if ch not in string.ascii_lowercase)
    if not cnt:
        return 'english'
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])
In [26]:
compute_accuracy(predict_language_diacritics)
Out[26]:
0.65058651026392966
In [27]:
%timeit compute_accuracy(predict_language_diacritics)
169 ms ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On chunks of text so small, we are far from guaranteed to have diacritics, which could explain the low accuracy.

To verify if our hypothesis is right, we can check the confusion matrix.

We use the pandas-ml library, which combines the power of scikit-learn with the readability of pandas.

In [28]:
from pandas_ml import ConfusionMatrix
ConfusionMatrix(sentences['label'], sentences['sentences'].apply(predict_language_diacritics))
/home/sdg/miniconda3/envs/dev/lib/python3.5/site-packages/pandas_ml/confusion_matrix/abstract.py:66: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  df = df.loc[idx, idx.copy()].fillna(0)  # if some columns or rows are missing
Out[28]:
Predicted   danish  dutch  english  finnish  french  german  italian  \
Actual                                                                 
danish           0      0        0        0       0       0        0   
dutch            0      0        0        0       0       0        0   
english          0      0     4521        0       0       0        0   
finnish          0      0      227      648       0     671        0   
french          66    115      295        0     646       0      462   
german           0     10      876      152       0     687        0   
italian          0      0        0        0       0       0        0   
portuguese      12     10      198        0      77       1       18   
spanish          0      0        0        0       0       0        0   
swedish         43      1       35       95       1      89        1   
__all__        121    136     6152      895     724    1448      481   

Predicted   portuguese  spanish  swedish  __all__  
Actual                                             
danish               0        0        0        0  
dutch                0        0        0        0  
english              0        0        0     4521  
finnish              0        0      614     2160  
french             339       81        0     2004  
german               0        0      175     1900  
italian              0        0        0        0  
portuguese        1213      140        0     1669  
spanish              0        0        0        0  
swedish              0        2     1119     1386  
__all__           1552      223     1908    13640  

The confusion matrix gives us two very interesting pieces of information.

First, a lot of sentences are predicted as English; actually, any sentence with no diacritics will be predicted as English, as there are no diacritics in the English language. On short sentences it is possible that, whatever the language, there are no diacritics.

Second, we can observe that, for example, a large number of Swedish sentences are predicted as Finnish. That can be explained by the fact that two out of three Swedish diacritics are also Finnish ones, and our naive implementation returns a language at random amongst the most probable in case of equality.

Now let's try to use diacritics in addition to stop words.

In [29]:
def predict_language_stopwords_diacritics(sentence):
    random.seed(0)
    cnt = Counter()
    cnt.update(language
              for word in sentence
              for language in stopwords_dict.get(word, ()))
    cnt.update(language
               for ch in ''.join(sentence).lower()
               for language in diacritics_transposed[ch]
               if ch not in string.ascii_lowercase)
    if not cnt:
        return 'unknown'
        
    m = max(cnt.values())
    return random.choice([k for k, v in cnt.items() if v == m])
In [30]:
compute_accuracy(predict_language_stopwords_diacritics)
Out[30]:
0.93995601173020527
In [31]:
%timeit compute_accuracy(predict_language_stopwords_diacritics)
463 ms ± 63.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With this solution we gain in accuracy at the expence of a slightly increased running time.

Train a classifier based on n-gram embeddings

We are going to try something a little more sophisticated, using Facebook's fastText library for text classification. In order to do that, we need a dataset to train on our classifier. We will use the European Parliament Proceedings corpus.

More information about fastText can be found in the documentation.

In [32]:
from pyfasttext import FastText
from sklearn.model_selection import train_test_split
from nltk import ngrams

The fastText library is trained on n-grams (tuples of n words), using a linear classifier on top of a hidden word embedding. Let's create a set of trigrams to learn on.

In [33]:
doc_set = [(language, clean_tokens(europarl_raw.__getattribute__(language).words())) for language in languages]

trigrams_set = [(language, ' '.join(trigram)) for (language, words) in doc_set
                                    for trigram in ngrams(words, 3)]
In [34]:
train_set, test_set = train_test_split(trigrams_set, test_size = 0.30, random_state=0)

pyfasttext is a wrapper around command line tool, so we will need to dump the sets to a file before training the classifier.

In [35]:
with open('train_data_europarl.txt', 'w') as f:
    for label, words in train_set:
        f.write('__label__{} {}\n'.format(label, words))
In [36]:
model = FastText()
model.supervised(input='train_data_europarl.txt', output='model_europarl', epoch=10, lr=0.7, wordNgrams=3)

We can then evaluate how good the training error and the test error are.

In [37]:
# train accuracy
labels, samples = np.split(np.array(train_set), 2, axis=1)
(np.array(model.predict(samples.T[0])) == labels).sum() / len(train_set)
Out[37]:
0.99680029382291524
In [38]:
# test accuracy
labels, samples = np.split(np.array(test_set), 2, axis=1)
(np.array(model.predict(samples.T[0])) == labels).sum() / len(test_set)
Out[38]:
0.98648199595051833

We can now apply this model to our initial dataset.

In [39]:
(model.predict(sentences['sentences'].str.join(' ') + '\n') == sentences['label'][:, None]).sum()/len(sentences)
Out[39]:
0.97514662756598236
In [40]:
%timeit model.predict(sentences['sentences'].str.join(' ') + '\n')
204 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conclusion

We have summed up our findings in the following table.

Algorithm Accuracy Execution time Comments
Stopwords based 92.5% 299 ms Baseline
Weighted stopwords 92.2% 413 ms
Diacritics 65.0% 169 ms
Diacritics + stopwords 94.0% 463 ms
langdetect 96.5% 51 300 ms Too slow to be of any use
pycld2 97.3% 134 ms External library; handles a large number of languages
fastText 97.5% 204 ms Needs a training corpus; can be trained on specialized data

Based on the results, the only two relevant options are either pycld2, which can handle over 165 languages and does not require labeled data to be used, and fastText, which might be a worthy alternative if one has specialized data on which to train it.

In the interest of fairness, we should note that external libraries can also handle non-european languages, which use non-latin scripts and for which the notion of "words" must be redefined. Our custom solution does not have the same ambition, and in addition requires a labeled corpus to be trained on.

Another important thing to note is that accuracy does not tell the whole story, so using a confusion matrix to identify the kind of mistakes the classifier makes is paramount. For the sake of brevity, confusion matrices have not been included here.