Reputation: 2942
I tried to detect language on short phrase and was surprised as detection result is wrong.
LanguageDetector detector = new OptimaizeLangDetector();
try {
detector.loadModels();
} catch (IOException e) {
LOG.error(e.getMessage(), e);
throw new ExceptionInInitializerError(e);
}
LanguageResult languageResult = detector.detect("Hello, my friend!")
The languageResult contains Norwegian with "medium" probability. Why? I think it have to be English instead. Longer phrases seems to be detected properly. Does this means that Apache Tika should not be used on short text?
Upvotes: 2
Views: 1426
Reputation: 3191
I could reproduce the issue. It may not directly answer the question but be considered as a workaround...
It seems that if you know what languages can be expected you can pass them to the detector via loadModels(models)
method. This approach helps to detect English correctly:
try {
Set<String> models=new HashSet<>();
models.add("en");
models.add("ru");
models.add("de");
LanguageDetector detector = new OptimaizeLangDetector()
// .setShortText(true)
.loadModels(models);
// .loadModels();
LanguageResult enResult = detector.detect("Hello, my friend!");
// LanguageResult ruResult = detector.detect("Привет, мой друг!");
// LanguageResult deResult = detector.detect("Hallo, mein Freund!");
System.out.println(enResult.getLanguage());
} catch (IOException e) {
throw new ExceptionInInitializerError(e);
}
Upvotes: 1
Reputation: 2608
This will not work in short text. As in documentantion say:
Implementation of the LanguageDetector API that uses https://github.com/optimaize/language-detector
From https://tika.apache.org/1.13/api/org/apache/tika/langdetect/OptimaizeLangDetector.html
Going to review that github and check the challenges they have some issues with short texts.
This software does not work as well when the input text to analyze is short, or unclean. For example tweets.
From their https://github.com/optimaize/language-detector Challenges Sector
Upvotes: 2