Apache Tika fails to detect language on short sentence. Why?

Question

I tried to detect language on short phrase and was surprised as detection result is wrong.

    LanguageDetector detector = new OptimaizeLangDetector();
    try {
        detector.loadModels();
    } catch (IOException e) {
        LOG.error(e.getMessage(), e);
        throw new ExceptionInInitializerError(e);
    }
    LanguageResult languageResult = detector.detect("Hello, my friend!")

The languageResult contains Norwegian with "medium" probability. Why? I think it have to be English instead. Longer phrases seems to be detected properly. Does this means that Apache Tika should not be used on short text?

Gatusko · Accepted Answer

This will not work in short text. As in documentantion say:

Implementation of the LanguageDetector API that uses https://github.com/optimaize/language-detector

From https://tika.apache.org/1.13/api/org/apache/tika/langdetect/OptimaizeLangDetector.html

Going to review that github and check the challenges they have some issues with short texts.

This software does not work as well when the input text to analyze is short, or unclean. For example tweets.

From their https://github.com/optimaize/language-detector Challenges Sector

Apache Tika fails to detect language on short sentence. Why?

Answers (2)

Related Questions