Marc Schulder
Marc Schulder

Reputation: 11

Compatibility of Stemmers between NLTK and Lucene

I'm using Lucene in Java to index a corpus and extract stemmed wordlists from it. I stem using the EnglishAnalyzer. Then I hand the wordlist to Python to do some things with NLTK. Is there a stemmer in NLTK that is fully compatible with the stemmer used by Lucene's EnglishAnalyzer?

I know I could also use PyLucene to circumvent this, but I would like to minimize dependencies.

Upvotes: 1

Views: 1080

Answers (2)

alvas
alvas

Reputation: 122102

So If i'm not wrong, lucene has several stemmer that are contributed by others (viz. snowball, egothor, stempel). Considering just the snowball stemmer vs the NLTK porter stemmer, even the NLTK api suggested that the snowball stemmer is more reliable. see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.porter-module.html.

If we make several comparison for English stemming (using http://snowball.tartarus.org/demo.php and http://text-processing.com/demo/stem/)

Snowball:

cat -> cat
computer -> comput
argues -> argu

NLTK Porter:

cat computer argue ->
cat comput argu

So from the demos, seems like they are pretty much the same, but to be sure, i would stick to snowball and continue to code in java because the NLTK api suggests so.

P/S: Hi Marc Schuler, (i'm the crazy asian who pronounce your name without the "d")

Upvotes: 0

Jacob
Jacob

Reputation: 4182

You can try out the various NLTK stemmers at http://text-processing.com/demo/stem/ and use the results to compare to how Lucene's EnglishAnalyzer works. Chances are it implements one of the common algorithms, either Porter or Lancaster.

Upvotes: 1

Related Questions