user3025289
user3025289

Reputation:

How to find singular in the plural when some letters change? What is the best approach?

How can I find the singular in the plural when some letters change?

Following situation:

As you see, the letter a has changed in ä. For this reason, the first word is not a substring of the second one anymore, they are "regex-technically" different.

Maybe I'm not in the right corner with my chosen tags below. Maybe Regex is not the right tool for me. I've seen naturaljs (natural.NounIflector()) provides this functionality out of the box for English words. Maybe there are also solutions for the German language in the same way?

What is the best approach, how can I find singular in the plural in German?

Upvotes: 5

Views: 772

Answers (2)

Jindřich
Jindřich

Reputation: 11213

You can use a stemmer (which is in fact a lemmatizer) from the nlp.js library, which has models for 40 languages.

const { StemmerDe } = require('@nlpjs/lang-de');

const stemmer = new StemmerDe();
console.log(stemmer.stemWord('Schließfach'));
console.log(stemmer.stemWord('Schließfächer'));

Upvotes: 2

Dan Levy
Dan Levy

Reputation: 1263

I once had to build a text processor that parsed many languages, including very casual to very formal. One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things - sometimes labeled with a plural form.)

IIRC, 70-90% of singular & plural word forms across all languages we supported had a "Levenshtein distance" of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because "distance" alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning.

Here's an example of the libraries we used:

const fastLevenshtein = require('fast-levenshtein');

console.log('Deburred Distances:')
console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach'));
// -> 3
console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer'));
// -> 3
console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech'));
// -> 7
console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech'));
// -> 12
console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse'));
// -> 8


/**
 * Additional strategy for dealing with other various languages:
 *   "Deburr" the strings to omit diacritics before checking the distance:
 */

const deburr = require('lodash.deburr');
console.log('Deburred Distances:')
console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach')));
// -> 3
console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer')));
// -> 3
console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech')));
// -> 7


// Same in this case, but helpful in other similar use cases.

Upvotes: 7

Related Questions