Reputation: 569
I'm working on a project that needs to count the occurrence of every word of a txt file. For example, I have a text file like this:
What Silver Lake Looks For in IPO Candidates 3 Companies Crushed by Earnings: Apple, Cirrus Logic, IBM IBM's Palmisano: How You Get To Be A 100-Year Old Company
If there are 3 sentences shown above in the file and I want to calculate every word's occurrence. Here, Companies and company should be considered as the same word "company"(lowercase), so the total occurrence for the word "company" is 2.
Is there any NLP toolkit for java that can tell two words like "families" and "family" are actually from the same word "family"?
I'll count the occurrence of every word to further do the Naive Bayes training, so it's very important to get the accurate numbers of occurrences of each word.
Upvotes: 1
Views: 1155
Reputation: 5082
What you are doing is called stemming (getting the root word).
As mentioned, Lingpipe, Gate and Lucene/Solr do stemming. Another option is the stanford parser. Or you could implement the Porter Stemming algo yourself.
Upvotes: 0
Reputation: 1
You may also look at GATE : http://gate.ac.uk/
If you want to use words to train a bag-of-word model, you can use TF-IDF value instead of the absolute count.
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
Upvotes: 0
Reputation: 80192
Apache Lucene and OpenNLP provide good stemming algorithm implementations. You can review and use the best one that suites you. I've been using Lucene for my projects.
Upvotes: 4