Aidanc
Aidanc

Reputation: 7011

Techniques for categorising natural language strings?

What is available in terms of libraries / open-source software for processing and categorising natural language? I've got a database full of strings which are user descriptions of a particular item. I'd like to categorise these words to weed out the useless and make an educated guess as to what category the item fits into (e.g Technology, Sport, Music).

I realise this a fairly specific request and my knowledge of natural language processing is very limited. I'm wondering what would be the best and if possible most computationally cheap way of making these sort of predictions?

I would prefer to do this in Ruby however, Python or Java is also acceptable.

Upvotes: 2

Views: 465

Answers (5)

yura
yura

Reputation: 14655

Check this list of natural language processing toolkits http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits. Some not mentioned here: Weka, Mallet, Stanford Classifier

Upvotes: 1

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20438

So you have a bunch of text chunks that you want to classify into different categories. The problem is identical to spam filtering. Except a spam filter only classifies emails into two categories, but you have several, but the same principles (Bayes' theorem) still applies. A Naive Bayes classifier is one of the simplest and least computationally demanding methods to tackle the problem. Then you can build on that knowledge and use more complicated methods such as neural networks to make more accurate classifications. A great book on the topic is Programming Collective Intelligence.

See also dANN which is a Java library that has a Naive Bayes classifier implementation and many other tools for predictive analysis. And this video about Google Predict which shows how to categorize sentences into languages. The same method could be used for classifying descriptions almost verbatim.

Upvotes: 1

the Tin Man
the Tin Man

Reputation: 160631

One of the top linguistic libraries for any programming language is called Wordnet. It's used to parse text, break it down, and determine parts of speech. If you saw IBM's Watson compete on the TV show Jeopardy, you saw Wordnet in action as it was one of the technologies used.

There is a "WordNet for Ruby" gem. I haven't used it, but I've used Wordnet many times. Hopefully Wordnet's installation will have become easier as it was a pain in the past.

Perl has the Lingua::Wordnet module that I have used. Also, a quick search for "Python + wordnet" returns several hits.

Upvotes: 1

Mike Lewis
Mike Lewis

Reputation: 64177

Unfortunately Ruby doesn't have a quality NLP, however if you use JRuby, you can take advantage of Java's quality NLP's such as:

GATE

LingPipe

OpenNLP

Upvotes: 2

julx
julx

Reputation: 9091

As to Python, for the moment I can recommend looking into:

http://www.nltk.org/

It has good documentation, and lots of lots of functionality in the field of natural language processing. Also there is a package in the Ubuntu repository (python-nltk), so it's easy to install and experiment with.

For most situations you'll need access to a good quality corpus.

Upvotes: 3

Related Questions