P_M
P_M

Reputation: 2942

Apache Tika fails to detect language on short sentence. Why?

I tried to detect language on short phrase and was surprised as detection result is wrong.

    LanguageDetector detector = new OptimaizeLangDetector();
    try {
        detector.loadModels();
    } catch (IOException e) {
        LOG.error(e.getMessage(), e);
        throw new ExceptionInInitializerError(e);
    }
    LanguageResult languageResult = detector.detect("Hello, my friend!")

The languageResult contains Norwegian with "medium" probability. Why? I think it have to be English instead. Longer phrases seems to be detected properly. Does this means that Apache Tika should not be used on short text?

Upvotes: 2

Views: 1426

Answers (2)

ka3ak
ka3ak

Reputation: 3191

I could reproduce the issue. It may not directly answer the question but be considered as a workaround...

It seems that if you know what languages can be expected you can pass them to the detector via loadModels(models) method. This approach helps to detect English correctly:

        try {
            Set<String> models=new HashSet<>();
            models.add("en");
            models.add("ru");
            models.add("de");
            LanguageDetector detector = new OptimaizeLangDetector()
//            .setShortText(true)
            .loadModels(models);
//            .loadModels();
            LanguageResult enResult = detector.detect("Hello, my friend!");
//            LanguageResult ruResult = detector.detect("Привет, мой друг!");
//            LanguageResult deResult = detector.detect("Hallo, mein Freund!");
            System.out.println(enResult.getLanguage());
        } catch (IOException e) {
            throw new ExceptionInInitializerError(e);
        }

Upvotes: 1

Gatusko
Gatusko

Reputation: 2608

This will not work in short text. As in documentantion say:

Implementation of the LanguageDetector API that uses https://github.com/optimaize/language-detector

From https://tika.apache.org/1.13/api/org/apache/tika/langdetect/OptimaizeLangDetector.html

Going to review that github and check the challenges they have some issues with short texts.

This software does not work as well when the input text to analyze is short, or unclean. For example tweets.

From their https://github.com/optimaize/language-detector Challenges Sector

Upvotes: 2

Related Questions