lizarisk
lizarisk

Reputation: 7820

Is there an open-source self-learning stemmer?

I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?

Upvotes: 1

Views: 444

Answers (4)

Jirka
Jirka

Reputation: 4213

Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.

What I would do:

  • identify a list of suffixes. I would try both unsupervised methods (?maybe try Linguistica?), and googling for a list of suffixes (these will often contain only a basic suffix which changes depending on vowel harmony). Iteratively you should arrive to some reasonable list. If in doubt if something is a suffix or not, I would throw it in.
  • Use the list to strip suffixes from words.

The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.

Upvotes: 2

hrzafer
hrzafer

Reputation: 1141

Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.

https://github.com/hrzafer/nuve

Since I'm the author, I'd be glad to help you with the process.

Upvotes: 2

GAM PUB
GAM PUB

Reputation: 228

You should look at Linguistica which has been developed by John Goldsmith and his team (@UChicago) for this purpose.

Upvotes: 1

Daniel Naber
Daniel Naber

Reputation: 1654

Are you talking about English? Then please see English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.

Upvotes: 0

Related Questions