Reputation: 813
A short intro
I had extracted a bunch of text from set of PDF files.. Those text are title of the document..
My objective is to classify the titles based on the terms appearing on it. That is if the title contains Car
then it must be classified as automobile
.
Example for my objective
Imagine the following titles:
1) DISTRIBUTED MESH NETWORK
2) MONITORING A SELF-CONTAINED SERVER RACK SYSTEM
3)SIDE PANEL FOR AN AUTOMOBILE
4) LOCATION-BASED VEHICLE MESSAGING SYSTEM
Now , the above mentioned title have to classified as
1st title contains term Network , So classify as Networking
2nd title contains term Server, So classify as Networking
3rd title contains term automobile, So classify as automobile
4th title contains term vehicle , so classify as automobile
This is what I need .
My Works
To achieve my objective I created a index of terms in text files for each category and matched it with a title .. if it contains a word in text files , then title get classified.
For example
Automobile.txt
have car , gear , wheel , clutch
.
networking.txt
have server,IP Address,TCP , RIP
This is the Algorithm:
String Classify (String title)
{
String area;
if (compareWordsFrom ("Automobile.txt",title) == true ) area = "Auto";
if (compareWordsFrom ("Netoworking.txt",title) == true ) area = "Networking";
if (compareWordsFrom ("metels.txt",title) == true ) area = "Metallurgy";
return area;
}
My Problem
My problem is , it is very difficult to find related words to build the index. That is , the field automobile have 1000 of related terms which difficult to find.
To be precise , building index of terms manually is a heart-breaking process.
My Need
I need an automated way for my work . Do Natural Language Processing techniques able to do it. ? OR I is there is an ready-made library available ?
Upvotes: 0
Views: 526
Reputation: 2299
http://en.wikipedia.org/wiki/WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online.
WordNet: http://wordnet.princeton.edu/
Upvotes: 1