Ever Think
Ever Think

Reputation: 813

How to classify a string that belongs to a particular area using java?

A short intro
I had extracted a bunch of text from set of PDF files.. Those text are title of the document..

My objective is to classify the titles based on the terms appearing on it. That is if the title contains Car then it must be classified as automobile.

Example for my objective

Imagine the following titles:

1) DISTRIBUTED MESH NETWORK
2) MONITORING A SELF-CONTAINED SERVER RACK SYSTEM
3)SIDE PANEL FOR AN AUTOMOBILE
4) LOCATION-BASED VEHICLE MESSAGING SYSTEM

Now , the above mentioned title have to classified as

1st title contains term Network , So classify as Networking
2nd title contains term Server, So classify as Networking
3rd title contains term automobile, So classify as automobile
4th title contains term vehicle , so classify as automobile

This is what I need .

My Works

To achieve my objective I created a index of terms in text files for each category and matched it with a title .. if it contains a word in text files , then title get classified.

For example

Automobile.txt have car , gear , wheel , clutch.
networking.txt have server,IP Address,TCP , RIP

This is the Algorithm:

String Classify (String title)
{
 String area;
 if (compareWordsFrom ("Automobile.txt",title) == true ) area = "Auto";
 if (compareWordsFrom ("Netoworking.txt",title) == true ) area = "Networking";
 if (compareWordsFrom ("metels.txt",title) == true ) area = "Metallurgy";
 return area;
}

My Problem
My problem is , it is very difficult to find related words to build the index. That is , the field automobile have 1000 of related terms which difficult to find.

To be precise , building index of terms manually is a heart-breaking process.

My Need
I need an automated way for my work . Do Natural Language Processing techniques able to do it. ? OR I is there is an ready-made library available ?

Upvotes: 0

Views: 526

Answers (2)

Omoro
Omoro

Reputation: 972

I think you should have a look at Lucene if you haven't done yet.

Upvotes: 0

anomal
anomal

Reputation: 2299

http://en.wikipedia.org/wiki/WordNet

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online.

WordNet: http://wordnet.princeton.edu/

Upvotes: 1

Related Questions