horatio.mars
horatio.mars

Reputation: 569

NLP for java, which toolkit should I use?

I'm working on a project that needs to count the occurrence of every word of a txt file. For example, I have a text file like this:

What Silver Lake Looks For in IPO Candidates 3 Companies Crushed by Earnings: Apple, Cirrus Logic, IBM IBM's Palmisano: How You Get To Be A 100-Year Old Company

If there are 3 sentences shown above in the file and I want to calculate every word's occurrence. Here, Companies and company should be considered as the same word "company"(lowercase), so the total occurrence for the word "company" is 2.

Is there any NLP toolkit for java that can tell two words like "families" and "family" are actually from the same word "family"?

I'll count the occurrence of every word to further do the Naive Bayes training, so it's very important to get the accurate numbers of occurrences of each word.

Upvotes: 1

Views: 1155

Answers (4)

nflacco
nflacco

Reputation: 5082

What you are doing is called stemming (getting the root word).

As mentioned, Lingpipe, Gate and Lucene/Solr do stemming. Another option is the stanford parser. Or you could implement the Porter Stemming algo yourself.

Upvotes: 0

jtremblay
jtremblay

Reputation: 1

You may also look at GATE : http://gate.ac.uk/

If you want to use words to train a bag-of-word model, you can use TF-IDF value instead of the absolute count.

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Upvotes: 0

daydreamer
daydreamer

Reputation: 92169

You can check LingPipe too : http://alias-i.com/lingpipe/

Upvotes: 0

Aravind Yarram
Aravind Yarram

Reputation: 80192

Apache Lucene and OpenNLP provide good stemming algorithm implementations. You can review and use the best one that suites you. I've been using Lucene for my projects.

Upvotes: 4

Related Questions