Reputation: 7011
What is available in terms of libraries / open-source software for processing and categorising natural language? I've got a database full of strings which are user descriptions of a particular item. I'd like to categorise these words to weed out the useless and make an educated guess as to what category the item fits into (e.g Technology, Sport, Music).
I realise this a fairly specific request and my knowledge of natural language processing is very limited. I'm wondering what would be the best and if possible most computationally cheap way of making these sort of predictions?
I would prefer to do this in Ruby however, Python or Java is also acceptable.
Upvotes: 2
Views: 465
Reputation: 14655
Check this list of natural language processing toolkits http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits. Some not mentioned here: Weka, Mallet, Stanford Classifier
Upvotes: 1
Reputation: 20438
So you have a bunch of text chunks that you want to classify into different categories. The problem is identical to spam filtering. Except a spam filter only classifies emails into two categories, but you have several, but the same principles (Bayes' theorem) still applies. A Naive Bayes classifier is one of the simplest and least computationally demanding methods to tackle the problem. Then you can build on that knowledge and use more complicated methods such as neural networks to make more accurate classifications. A great book on the topic is Programming Collective Intelligence.
See also dANN which is a Java library that has a Naive Bayes classifier implementation and many other tools for predictive analysis. And this video about Google Predict which shows how to categorize sentences into languages. The same method could be used for classifying descriptions almost verbatim.
Upvotes: 1
Reputation: 160631
One of the top linguistic libraries for any programming language is called Wordnet. It's used to parse text, break it down, and determine parts of speech. If you saw IBM's Watson compete on the TV show Jeopardy, you saw Wordnet in action as it was one of the technologies used.
There is a "WordNet for Ruby" gem. I haven't used it, but I've used Wordnet many times. Hopefully Wordnet's installation will have become easier as it was a pain in the past.
Perl has the Lingua::Wordnet module that I have used. Also, a quick search for "Python + wordnet" returns several hits.
Upvotes: 1
Reputation: 64177
Unfortunately Ruby doesn't have a quality NLP, however if you use JRuby, you can take advantage of Java's quality NLP's such as:
Upvotes: 2
Reputation: 9091
As to Python, for the moment I can recommend looking into:
It has good documentation, and lots of lots of functionality in the field of natural language processing. Also there is a package in the Ubuntu repository (python-nltk), so it's easy to install and experiment with.
For most situations you'll need access to a good quality corpus.
Upvotes: 3