# Various options to (western) language detection¶

In order to optimise a NLP preprocessing pipeline, or to be able to tag a batch of documents and to present a user only with results in their preferred language, it might be useful to automatically determine the language of a text sample.

This article presents various options to do so in Python, from custom solutions to external libraries. Each solution is evaluated according to three dimensions, accuracy in language detection, execution time and ease of use.

## Experimental setup¶

We use the genesis corpus from nltk, which has the advantage of being easily available. You can download it as follow after installing nltk :

In [1]:
import nltk

[nltk_data] Downloading package genesis to /home/sdg/nltk_data...
[nltk_data]   Package genesis is already up-to-date!

Out[1]:
True

The genesis corpus contains the text from the Genesis in 6 languages: Finnish, French, German, Portuguese, Swedish, and three different English versions.

The writing style might not be representative of the typical context in which language detection could be used (very formal and rather outdated), but it had the advantage of being already labeled.

In all of the following, the genesis corpus will be used solely for testing. When we train our classifier for custom solutions, we will use other data sources.

We will compute accuracy when predicting each sentence of the corpus, and the execution time for predicting the complete dataset.

External depencies in addition to nltk are numpy and pandas.

### Dataset creation¶

We create a Pandas dataframe containing all sentences with their associated labels.

In [2]:
import pandas as pd
import numpy as np
from nltk.corpus import genesis as dataset

dfs  = []
for ids in dataset.fileids():
df = pd.DataFrame(data=np.array(dataset.sents(ids)), columns=['sentences'])
df['label'] = ids.strip('.txt') if ids not in {'english-kjv.txt', 'english-web.txt', 'lolcat.txt'} else 'english'
dfs.append(df)
sentences = pd.concat(dfs)


## Naive solution (baseline)¶

We present here a naive solution relying on stop words (most common words in a language). We will use the stopwords corpus from nltk.

We first create a dictionary of stop words per language. It must be noted that this dictionnary includes languages which are not present in the genesis corpus, such as Norwegian or Danish. This ensures a fair comparison between custom solutions and external libraries (which have no restriction on which languages might be present).

In [3]:
from nltk.corpus import stopwords
from collections import defaultdict

languages = stopwords.fileids()
stopwords_dict = defaultdict(list)
for l in languages:
for sw in stopwords.words(l):
stopwords_dict[sw].append(l)


For each sentence (represented as a list of tokens), we compute the number of stop words of each language present in the sentence, using a dictionary to accumulate the counts. Then, we simply predict the sentence to be of the language with the largest count (if the dictionary is not empty; else we predict 'unknown').

In case of equality, we toss a coin and choose at random.

In [4]:
from collections import defaultdict, Counter
import random

def predict_language_naive(sentence):
random.seed(0)
cnt = Counter()
cnt.update(language
for word in sentence
for language in stopwords_dict.get(word, ()))
if not cnt:
return 'unknown'

m = max(cnt.values())
return random.choice([k for k, v in cnt.items() if v == m])


We can compute the accuracy as follow :

In [5]:
def compute_accuracy(predictor):
return (sentences['sentences'].apply(predictor) == sentences['label']).sum() / len(sentences)

In [6]:
compute_accuracy(predict_language_naive)

Out[6]:
0.92565982404692082

As a side note, accuracy might not be the ideal metrics here, since we have a slightly unbalanced class distribution, with English being 3 times as frequent as any other language.

Execution time is computed using the timeit magic.

In [7]:
%timeit compute_accuracy(predict_language_naive)

299 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


This solution is quite fast, but not very accurate. It does not use any external library which might be an advantage in some contexts.

## External libraries¶

Now that we have a baseline, we can benchmark a few external libraries to see how good they perform. They will probably be more accurate, but at what cost in term of execution time?

Two libraries have been tested, langdetect and pycld2.

### langdetect¶

The official documentation can be found here. It's a port of a Google library in Python. Unfortunately, the code is not very Pythonic...

It's easily installed with pip.

In [8]:
from langdetect import detect, lang_detect_exception


The langdetect API takes whole sentences (not tokenised) as input, so we first concatenate tokenised sentences.

Another thing is that the detect function may raise an exception when it is unsure about the language, in which case we want to have an unknown label. Our wrapper should catch the exception.

Another thing we want to consider is that the output is the ISO 639-1 code for the language, which is not very user-friendly. We use a mapping dictionary to convert the output.

In [9]:
iso_to_human = {'aa': 'afar', 'ab': 'abkhazian', 'af': 'afrikaans', 'ak': 'akan', 'am': 'amharic', 'an': 'aragonese', 'ar': 'arabic', 'as': 'assamese', 'av': 'avar', 'ay': 'aymara', 'az': 'azerbaijani', 'ba': 'bashkir', 'be': 'belarusian', 'bg': 'bulgarian', 'bh': 'bihari', 'bi': 'bislama', 'bm': 'bambara', 'bn': 'bengali', 'bo': 'tibetan', 'br': 'breton', 'bs': 'bosnian', 'ca': 'catalan', 'ce': 'chechen', 'ch': 'chamorro', 'co': 'corsican', 'cr': 'cree', 'cs': 'czech', 'cu': 'old bulgarian', 'cv': 'chuvash', 'cy': 'welsh', 'da': 'danish', 'de': 'german', 'dv': 'divehi', 'dz': 'dzongkha', 'ee': 'ewe', 'el': 'greek', 'en': 'english', 'eo': 'esperanto', 'es': 'spanish', 'et': 'estonian', 'eu': 'basque', 'fa': 'persian', 'ff': 'peul', 'fi': 'finnish', 'fj': 'fijian', 'fo': 'faroese', 'fr': 'french', 'fy': 'west frisian', 'ga': 'irish', 'gd': 'scottish gaelic', 'gl': 'galician', 'gn': 'guarani', 'gu': 'gujarati', 'gv': 'manx', 'ha': 'hausa', 'he': 'hebrew', 'hi': 'hindi', 'ho': 'hiri motu', 'hr': 'croatian', 'ht': 'haitian', 'hu': 'hungarian', 'hy': 'armenian', 'hz': 'herero', 'ia': 'interlingua', 'id': 'indonesian', 'ie': 'interlingue', 'ig': 'igbo', 'ii': 'sichuan yi', 'ik': 'inupiak', 'io': 'ido', 'is': 'icelandic', 'it': 'italian', 'iu': 'inuktitut', 'ja': 'japanese', 'jv': 'javanese', 'kg': 'kongo', 'ki': 'kikuyu', 'kj': 'kuanyama', 'kk': 'kazakh', 'kl': 'greenlandic', 'km': 'cambodian', 'kn': 'kannada', 'ko': 'korean', 'kr': 'kanuri', 'ks': 'kashmiri', 'ku': 'kurdish', 'kv': 'komi', 'kw': 'cornish', 'ky': 'kirghiz', 'la': 'latin', 'lb': 'luxembourgish', 'lg': 'ganda', 'li': 'limburgian', 'ln': 'lingala', 'lo': 'laotian', 'lt': 'lithuanian', 'lv': 'latvian', 'mg': 'malagasy', 'mh': 'marshallese', 'mi': 'maori', 'mk': 'macedonian', 'ml': 'malayalam', 'mn': 'mongolian', 'mo': 'moldovan', 'mr': 'marathi', 'ms': 'malay', 'mt': 'maltese', 'my': 'burmese', 'na': 'nauruan', 'nd': 'north ndebele', 'ne': 'nepali', 'ng': 'ndonga', 'nl': 'dutch', 'nn': 'norwegian nynorsk', 'no': 'norwegian', 'nr': 'south ndebele', 'nv': 'navajo', 'ny': 'chichewa', 'oc': 'occitan', 'oj': 'ojibwa', 'om': 'oromo', 'or': 'oriya', 'os': 'ossetian', 'pa': 'punjabi', 'pi': 'pali', 'pl': 'polish', 'ps': 'pashto', 'pt': 'portuguese', 'qu': 'quechua', 'rm': 'raeto romance', 'rn': 'kirundi', 'ro': 'romanian', 'ru': 'russian', 'rw': 'rwandi', 'sa': 'sanskrit', 'sc': 'sardinian', 'sd': 'sindhi', 'sg': 'sango', 'sh': 'serbo-croatian', 'si': 'sinhalese', 'sk': 'slovak', 'sl': 'slovenian', 'sm': 'samoan', 'sn': 'shona', 'so': 'somalia', 'sq': 'albanian', 'sr': 'serbian', 'ss': 'swati', 'st': 'southern sotho', 'su': 'sundanese', 'sv': 'swedish', 'sw': 'swahili', 'ta': 'tamil', 'te': 'telugu', 'tg': 'tajik', 'th': 'thai', 'ti': 'tigrinya', 'tk': 'turkmen', 'tl': 'tagalog', 'tn': 'tswana', 'to': 'tonga', 'tr': 'turkish', 'ts': 'tsonga', 'tt': 'tatar', 'tw': 'twi', 'ty': 'tahitian', 'ug': 'uyghur', 'ur': 'urdu', 've': 'venda', 'vi': 'vietnamese', 'vo': 'volapük', 'wa': 'walloon', 'wo': 'wolof', 'xh': 'xhosa', 'yi': 'yiddish', 'yo': 'yoruba', 'za': 'zhuang', 'zh': 'chinese', 'zu': 'zulu'}

def detect_without_exception(s):
try:
return iso_to_human[detect(' '.join(s))]
except lang_detect_exception.LangDetectException:
return 'unknown'


Here we go for the prediction accuracy, and the execution time.

In [10]:
compute_accuracy(detect_without_exception)

Out[10]:
0.96539589442815255
In [11]:
%timeit compute_accuracy(detect_without_exception)

51.3 s ± 966 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We have improved the classification accuracy, at the expense of being more than 150 times slower. It will not be acceptable in most use cases.

### pycld2¶

pycld2 provides Python bindings around Google compact language detection library (CLD2).

The API exposes more details than langdetect, providing a confidence percentage for each language detected, and since it's a wrapper on a C++ compiled binary, we can hope that it'll be faster.

It's easily installed with pip.

It is the underlying library used by Polyglot, a NLP library offering a wide variety of tools for handling multilingual usages. Check it out !

As langdetect, pycld2 takes whole sentences as input, so we will reuse our previously defined sentences_agg.

In [12]:
import pycld2 as cld2

compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True)[2][0][0].lower())

Out[12]:
0.97375366568914956
In [13]:
%timeit compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True)[2][0][0].lower())

134 ms ± 776 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


The accuracy is actually sligtly better than what we have with langdetect, and it's even faster than our naive solution.

The downside is that the GitHub repository has not been updated since 2015, and the documentation seems out of sync. Furthermore, the computation is not made in Python, which makes it harder to alter the code to suit custom needs.

One last thing we can try is to biais the algorithm towards choosing English more often, given that it is the more frequent language.

In [14]:
compute_accuracy(lambda s: cld2.detect(' '.join(s), bestEffort=True, hintLanguage='ENGLISH')[2][0][0].lower())

Out[14]:
0.96796187683284463

Here, it does not improve accuracy, maybe because we have such short pieces of text to label, but it might be of use in other contexts.

## Improvements on the naive solution¶

Can we beat the 97% accuracy of a off-the-shelf solution? Let's try to improve our naive solution.

### Training dataset¶

In order to improve our naive solution, we will need another source of multilingual text - using the genesis corpus would be cheating since it's our test set.

We use the European Parliament Proceedings Parallel Corpus which we can download with nltk.

In [15]:
from nltk.corpus import europarl_raw


We can obtain the list of words for each language as follow :

In [16]:
europarl_raw.english.words()

Out[16]:
['Resumption', 'of', 'the', 'session', 'I', 'declare', ...]

We define the list of languages for which we have data:

In [17]:
languages = ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'italian', 'portuguese', 'spanish', 'swedish']


We also define a small function to help us clean our lists of tokens.

In [18]:
def clean_tokens(tokens):
return [token.lower() for token in tokens if token.isalpha()]


### Weight stop words¶

We can observe that some stop words are present in more than one language. We can consider that these words are less discriminant with respect to the languages they belong to, so we want to assign them a weight proportionnal to how frequent a stop word is in the set of all languages.

In [19]:
weighted_stopwords_dict = defaultdict(dict)
for sword, langs in stopwords_dict.items():
coeff = 1/ len(langs)
for lang in langs:
weighted_stopwords_dict[sword][lang] = coeff

In [20]:
def predict_language_weighted_stopwords(sentence):
random.seed(0)
cnt = Counter()
for word in sentence:
if word in weighted_stopwords_dict:
cnt.update(weighted_stopwords_dict[word])

if not cnt:
return 'unknown'
m = max(cnt.values())
return random.choice([k for k, v in cnt.items() if v == m])

In [21]:
compute_accuracy(predict_language_weighted_stopwords)

Out[21]:
0.92184750733137832
In [22]:
%timeit compute_accuracy(predict_language_weighted_stopwords)

413 ms ± 47.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Unfortunately, this weighting scheme does not improve our naive solution.

### Use diacritics¶

Diacritics are, as defined by Wikipedia, glyphs added to a letter. They can be quite distinctive of a given language (if present), and so we want to use them in addition to stopwords to improve our classification accuracy for western languages.

First, we need to determine a list of diacritics used per language. We will use the European Parliament Proceedings to do so.

In the first line of the function, we get a list of all characters presents in the proceedings for a given language, after cleaning the tokens (we keep only alphabetic words and we cast everything to lower case).

Then we count the number of occurences for each character. We remove characters occuring less than 500 times, since they can come from foreign words such as surnames or location names, and we only want to keep typical diacritics for a language.

In a last step, we remove non-accentuated characters (= ascii characters) from the set.

In [23]:
import string

def get_diacritics(language):
char_list = list(''.join(clean_tokens(europarl_raw.__getattribute__(language).words())))
cnt = Counter(char_list)
frequent_chars = {k for k, v in cnt.items() if v > 500}
return frequent_chars - set(string.ascii_lowercase)


Let's print the list of diacritics per language.

In [24]:
diacritics = {language: list(get_diacritics(language)) for language in languages}
diacritics

Out[24]:
{'danish': ['æ', 'å', 'ø', 'é'],
'dutch': ['ë', 'é'],
'english': [],
'finnish': ['ö', 'ä'],
'french': ['à', 'û', 'ô', 'ê', 'è', 'ç', 'é', 'î'],
'german': ['ö', 'ü', 'ä', 'ß'],
'italian': ['à', 'ò', 'ù', 'è', 'ì', 'é'],
'portuguese': ['à', 'ú', 'ê', 'ã', 'ç', 'á', 'é', 'í', 'ó', 'õ', 'â'],
'spanish': ['ú', 'ñ', 'á', 'é', 'í', 'ó'],
'swedish': ['ö', 'å', 'ä']}

The lists seem about right (at least for the languages I know), and it's running reasonably fast for a naive solution.

Now what we have a list of diacritics, we can use the same method as we used for stop words to detect language.

At first, let's try to only use diacritics.

In [25]:
diacritics_transposed = defaultdict(list)
for language, chars in diacritics.items():
for char in chars:
diacritics_transposed[char].append(language)

def predict_language_diacritics(sentence):
cnt = Counter()
cnt.update(language
for ch in ''.join(sentence).lower()
for language in diacritics_transposed[ch]
if ch not in string.ascii_lowercase)
if not cnt:
return 'english'
m = max(cnt.values())
return random.choice([k for k, v in cnt.items() if v == m])

In [26]:
compute_accuracy(predict_language_diacritics)

Out[26]:
0.65058651026392966
In [27]:
%timeit compute_accuracy(predict_language_diacritics)

169 ms ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


On such small chunks of text, we are far from guaranteed to have diacritics, which could explain the low accuracy.

Let's check the confusion matrix to see if our hypothesis is right.

We use the pandas-ml library, which combines the power of scikit-learn with the readability of pandas.

In [28]:
from pandas_ml import ConfusionMatrix
ConfusionMatrix(sentences['label'], sentences['sentences'].apply(predict_language_diacritics))

/home/sdg/miniconda3/envs/dev/lib/python3.5/site-packages/pandas_ml/confusion_matrix/abstract.py:66: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
df = df.loc[idx, idx.copy()].fillna(0)  # if some columns or rows are missing

Out[28]:
Predicted   danish  dutch  english  finnish  french  german  italian  \
Actual
danish           0      0        0        0       0       0        0
dutch            0      0        0        0       0       0        0
english          0      0     4521        0       0       0        0
finnish          0      0      227      648       0     671        0
french          66    115      295        0     646       0      462
german           0     10      876      152       0     687        0
italian          0      0        0        0       0       0        0
portuguese      12     10      198        0      77       1       18
spanish          0      0        0        0       0       0        0
swedish         43      1       35       95       1      89        1
__all__        121    136     6152      895     724    1448      481

Predicted   portuguese  spanish  swedish  __all__
Actual
danish               0        0        0        0
dutch                0        0        0        0
english              0        0        0     4521
finnish              0        0      614     2160
french             339       81        0     2004
german               0        0      175     1900
italian              0        0        0        0
portuguese        1213      140        0     1669
spanish              0        0        0        0
swedish              0        2     1119     1386
__all__           1552      223     1908    13640  

The confusion matrix gives us two very interesting pieces of information.

First, a lot of sentences are predicted as English; actuallly, any sentence with no diacritics will be predicted as English, as there are no diacritics in the English language. On short sentences, it is possible that whatever the language, there are no diacritics.

Secundly, we can observe that for example, a large number of Swedish sentences are predicted as Finnish. That can be explained by the fact that two out of three Swedish diacritics are also Finnish ones, and the fact that our naive implementation returns a language at random amongst the most probable in case of equality.

Let's try now to use the diacritics in addition to the stop words.

In [29]:
def predict_language_stopwords_diacritics(sentence):
random.seed(0)
cnt = Counter()
cnt.update(language
for word in sentence
for language in stopwords_dict.get(word, ()))
cnt.update(language
for ch in ''.join(sentence).lower()
for language in diacritics_transposed[ch]
if ch not in string.ascii_lowercase)
if not cnt:
return 'unknown'

m = max(cnt.values())
return random.choice([k for k, v in cnt.items() if v == m])

In [30]:
compute_accuracy(predict_language_stopwords_diacritics)

Out[30]:
0.93995601173020527
In [31]:
%timeit compute_accuracy(predict_language_stopwords_diacritics)

463 ms ± 63.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We do have a gain in accuracy, at the expence of a slightly increased running time.

### Learn a classifier based on n-grams embeddings¶

We are going to try something a little more sophisticated, using Facebook's library FastText for text classification. In order to that, we are going to need a dataset to train on our classifier, we are going to use the European Parlement Proceedings corpus.

In [32]:
from pyfasttext import FastText
from sklearn.model_selection import train_test_split
from nltk import ngrams


The fastText library is trained on n-grams (tuples of n words), using a linear classifier on top of a hidden word embedding. Let's create a set of trigrams to learn on.

In [33]:
doc_set = [(language, clean_tokens(europarl_raw.__getattribute__(language).words())) for language in languages]

trigrams_set = [(language, ' '.join(trigram)) for (language, words) in doc_set
for trigram in ngrams(words, 3)]

In [34]:
train_set, test_set = train_test_split(trigrams_set, test_size = 0.30, random_state=0)


pyfasttext is a wrapper around command line tool, so we will need to dump the sets to a file before training the classifier.

In [35]:
with open('train_data_europarl.txt', 'w') as f:
for label, words in train_set:
f.write('__label__{} {}\n'.format(label, words))

In [36]:
model = FastText()
model.supervised(input='train_data_europarl.txt', output='model_europarl', epoch=10, lr=0.7, wordNgrams=3)


We can then evaluate how good is the training error and the test error.

In [37]:
# train accuracy
labels, samples = np.split(np.array(train_set), 2, axis=1)
(np.array(model.predict(samples.T[0])) == labels).sum() / len(train_set)

Out[37]:
0.99680029382291524
In [38]:
# test accuracy
labels, samples = np.split(np.array(test_set), 2, axis=1)
(np.array(model.predict(samples.T[0])) == labels).sum() / len(test_set)

Out[38]:
0.98648199595051833

We can now apply this model to our initial dataset.

In [39]:
(model.predict(sentences['sentences'].str.join(' ') + '\n') == sentences['label'][:, None]).sum()/len(sentences)

Out[39]:
0.97514662756598236
In [40]:
%timeit model.predict(sentences['sentences'].str.join(' ') + '\n')

204 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Conclusion¶

We sum up our findings in the following table.