Identifying natural languages from small samples in Python

Question

Using Python, I want to identify French text in a list of short strings (from 1 to about 50 words) which are otherwise in English.

An example of the input data (input strings here are separated by commas):

year of the snake, legendary 'dragon horse', thunder, damsel-fly, larvae of mosquito, 
treillage, libellule, mythical water creature, petites chevrettes, de papillon hideux, 
the horse-fly, 5th earthly branch, dragon, mythical creature, 
a shore plant whose leaves dry a bright orange, dragon horse, god of rain, year of the dragon, 
orthopteran, crocodile, dont le duvet des ailes s'en va en poussière, insecte, dragonfly, 
dracontomelon vitiense, dragon king, petit filet pour une espèce de papillon, sorte d'insecte

Ideally I want to use a library that's already been built, as I'm aware that this is a difficult problem. However, the natural language library in Python I am most familiar with, nltk, does not seem to have the ability to do this, or if it does I haven't found it.

I'm aware that identifying a single word or two is likely to be very difficult, and I'd rather have false negatives (French misidentified as English) than false positives.

sophros · Accepted Answer

There are various approaches to this problem. A rather more traditional and exact (but also prone to issues with new words) is to use a thesaurus for French and English and check if the phrase is found in one or the other (full match or more words matching).

Another one is to use a package for language detection.

Yet another one would be to use an ML language model to classify phrases (e.g. SpaCy lang_detect model).

Identifying natural languages from small samples in Python

Answers (1)

Related Questions