Reputation: 405
I'm wondering if it is possible to use Stanford CoreNLP
to detect which language a sentence is written in? If so, how precise can those algorithms be?
Upvotes: 10
Views: 6324
Reputation: 122002
Standford CoreNLP doesn't have language ID (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml
There are loads more on language detection/identification tools. But do take the reported precision with a pinch of salt. It is usually evaluated narrowly, bounded by:
Notable language ID tools includes:
An exhaustive list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/
Noteworthy Language Identification related shared task (with training/testing data) includes:
Also take a look at:
Upvotes: 11
Reputation: 4749
Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.
EDIT: Nevertheless, below are circumstantial evidences:
Language
classes, but nothing related to language identification - you can
check manually for all 84 occurrence of 'language' word hereTry TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").
In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.
Upvotes: 11