Reputation: 7820
I need to implement some sort of stemmer/lemmatizer. I have some words in different forms (a few thousands). It's not a morphological dictionary, just a small part of it. Is it a good idea to learn a stemmer automatically from the file a have? Is there any open-source implementations that can be used?
Upvotes: 1
Views: 444
Reputation: 4213
Azerbaijani is an agglutinative language, similar to Turkish, which means words frequently have a chain of suffixes (e.g. one suffix for plural and one of accusative). Also it has vowel harmony, which means each suffix has several variants and you choose the correct one based on the vowels in the root.
What I would do:
The resulting stemmer will be noisy, but depending on what you need it for, it might not matter.
Upvotes: 2
Reputation: 1141
Nuve is an NLP library for Turkic languages. Once the language rules and data are prepared, it can analyze and generate words for any Turkic language if not for any agglutinative language. You can fork it and prepare new orthography and morphology files for azeri.
https://github.com/hrzafer/nuve
Since I'm the author, I'd be glad to help you with the process.
Upvotes: 2
Reputation: 228
You should look at Linguistica which has been developed by John Goldsmith and his team (@UChicago) for this purpose.
Upvotes: 1
Reputation: 1654
Are you talking about English? Then please see English lemmatizer databases?. Considering the significant amount of exceptions, a machine-learning approach without a large dictionary does not seem promising.
Upvotes: 0