Reputation: 11
I'm using Lucene in Java to index a corpus and extract stemmed wordlists from it. I stem using the EnglishAnalyzer. Then I hand the wordlist to Python to do some things with NLTK. Is there a stemmer in NLTK that is fully compatible with the stemmer used by Lucene's EnglishAnalyzer?
I know I could also use PyLucene to circumvent this, but I would like to minimize dependencies.
Upvotes: 1
Views: 1080
Reputation: 122102
So If i'm not wrong, lucene has several stemmer that are contributed by others (viz. snowball, egothor, stempel). Considering just the snowball stemmer vs the NLTK porter stemmer, even the NLTK api suggested that the snowball stemmer is more reliable. see http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.porter-module.html.
If we make several comparison for English stemming (using http://snowball.tartarus.org/demo.php and http://text-processing.com/demo/stem/)
Snowball:
cat -> cat
computer -> comput
argues -> argu
NLTK Porter:
cat computer argue ->
cat comput argu
So from the demos, seems like they are pretty much the same, but to be sure, i would stick to snowball and continue to code in java because the NLTK api suggests so.
P/S: Hi Marc Schuler, (i'm the crazy asian who pronounce your name without the "d")
Upvotes: 0
Reputation: 4182
You can try out the various NLTK stemmers at http://text-processing.com/demo/stem/ and use the results to compare to how Lucene's EnglishAnalyzer works. Chances are it implements one of the common algorithms, either Porter or Lancaster.
Upvotes: 1