Understanding hunspell stemming, why aren't plural and singular stemmed the same?

Question

We are using hunspell in elasticsearch to help us stem irregular nouns, but it doesn't really give us the expected result.

Fx "gulerod" (carrot) vs "gulerødder" (carrots) are stemmed to "gulerod" (word root) and "gulerødder" respectively.

I have tried stemming the words using https://www.npmjs.com/package/nodehun as well with the same outcome, which leads me to think it is a hunspell/dictionary issue.

I have tried out a couple of different da_DK and nb_NO fx. from https://stavekontrolden.dk/?dictionaries=1, LibreOffice and debian all various (older) versions of the first.

A little test-case

    const {Nodehun} = require('nodehun');
    const fs = require('fs');

    const affix = fs.readFileSync(
        `./elasticsearch/dictionaries/hunspell/yy_YY/yy_YY.aff`
    );
    const dictionary = fs.readFileSync(
        `./elasticsearch/dictionaries/hunspell/yy_YY/yy_YY.dic`
    );
    const nodehun = new Nodehun(affix, dictionary);

    const words = [
        'gulerod',
        'gulerødder',
        'mand',
        'mænd',
        'mønster',
        'mønstre'
    ];

    for (let word of words) {
        const stems = await nodehun.stem(word);
        console.dir({word, stems});
    }

which outputs

{ word: 'gulerod', stems: [ 'gulerod' ] }
{ word: 'gulerødder', stems: [ 'gulerødder' ] }
{ word: 'mand', stems: [ 'mand', 'mande' ] }
{ word: 'mænd', stems: [ 'mænd' ] }
{ word: 'mønster', stems: [ 'mønster' ] }
{ word: 'mønstre', stems: [ 'mønstre', 'mønster' ] }

As you can see it handles mønster/mønstre correctly, but here the irregularity aren't with the vowels - could that be an issue?

Now the question(s): Is this due to hunspell? Or the dictionary? And is there anything we can do to fix this?

Explanation: It turns out it is down how the danish (and possible norwegian and swedish) dictionaries are constructed. "gulerod" and "gulerødder" are treated as 2 distinct words https://github.com/jeppebundsgaard/stavekontrolden/issues/4

Understanding hunspell stemming, why aren't plural and singular stemmed the same?

Answers (0)

Related Questions

Understanding hunspell stemming, why aren&#39;t plural and singular stemmed the same?

Answers (0)

Related Questions

Understanding hunspell stemming, why aren't plural and singular stemmed the same?