Kelvin Lee
Kelvin Lee

Reputation: 405

Detecting language using Stanford NLP

I'm wondering if it is possible to use Stanford CoreNLP to detect which language a sentence is written in? If so, how precise can those algorithms be?

Upvotes: 10

Views: 6324

Answers (2)

alvas
alvas

Reputation: 122002

Standford CoreNLP doesn't have language ID (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml


There are loads more on language detection/identification tools. But do take the reported precision with a pinch of salt. It is usually evaluated narrowly, bounded by:

  • a fix list of languages,
  • a substantial length of the test sentences and
  • of the same language and
  • a skewed proportion of training to testing instances.

Notable language ID tools includes:

An exhaustive list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/


Noteworthy Language Identification related shared task (with training/testing data) includes:


Also take a look at:

Upvotes: 11

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.

EDIT: Nevertheless, below are circumstantial evidences:

  1. there is no mention of language identification neither on main page, nor CoreNLP page, nor in FAQ (although there is a question 'How do I run CoreNLP on other languages?'), nor in 2014 paper of CoreNLP's authors;
  2. tools that combine several NLP libs including Stanford CoreNLP use another lib for language identification, for example DKPro Core ASL; also other users talking about language identification and CoreNLP don't mention this capability
  3. source file of CoreNLP contains Language classes, but nothing related to language identification - you can check manually for all 84 occurrence of 'language' word here

Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").

In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.

Upvotes: 11

Related Questions