Reputation: 35
My problem is to filter out all the names of persons in a table, i.e. names of companies, schools, institutions will be left in the database.
I tried a simple solution wherein I was given a list of the name of companies, schools, etc. And I searched for the most common terms there. (Note: I did not search for the common strings in a name, since that would cost a lot). I assigned weight to those terms, and also to the most common substrings. With that, if the string has a corp, inc, school, univ in it then it's very highly possible that it's not a name of a person.
Now, my problem is how can I make it into an AI. Moreover, I will have to make it possible such that classifications of companies only, schools only, etc. will be easier.
For example
XYZ Brewery Corporation -> company
Harvard University -> school
Department of Health -> government agency
The only AI techniques I know are Naive-Bayes, K-Means, Hierarchical, FCM, ANN. Those techniques commonly get numerical values, so, I don't know how to make it into an AI. The only AI techniques that I know that handles strings extensively are Levenshtein, Stemming, Needleman-Wunch and Jaro-Winkler.
Is my first approach incorrect? How can incorporate the techniques that I know? Do I have to learn a new technique? I'm basically new to AI since I am still a student. However, this is not an assignment but it's for a company project (actually I am the only computer science major in our group, so it's very heavy on my part). By the way, if you are curious on what language I use, I am using C# since I am planning to make it just a stand-alone application and the users are using Windows.
Upvotes: 3
Views: 3544
Reputation: 3249
The python library probablepeople use a conditional random field model to do this. (I'm an contributor to this project),
Upvotes: 0
Reputation: 9256
Don't just jump into fancy machine learning algorithms. Your common sense and intuition can get you quite far.
Your idea of having large lists of entities is pretty good, and might work out very well for schools, if you can find a list of all the post-secondary institutions in the world. If you can compile together such a list, its unlikely it will contain every university in the world, but it will probably be good enough for all practical purposes.
From the lists you have already compiled, you can count the number of times every unigram (i.e. word) and bigram (i.e. consecutive pairs of words) occurs for each class of entities and see certain phrases strongly tend towards a particular class (e.g. 'department of' might mostly occur for government agencies, 'inc', 'ltd', '& co.' might occur only for companies, 'university', 'school', 'college' might occur mostly for schools). You can formalize these ideas into a Naive Bayes model, but having a simpler rule that just checks for certain phrases in a a large if-then statement might get you 90% of the way there.
Upvotes: 3
Reputation: 28846
This problem is generally called Named Entity Recognition (NER). The SharpNLP project is a C# library of NLP algorithms, including NER. It seems to be completely undocumented, though it's a C# port of Apache's OpenNLP, which has documentation on name finding; SharpNLP's interface is presumably similar.
Upvotes: 4
Reputation: 7738
You might want to take a look at the Febrl project.
Febrl (Freely Extensible Biomedical Record Linkage) does data standardisation (segmentation and cleaning) and probabilistic record linkage ("fuzzy" matching) of one or more files or data sources which do not share a unique record key or identifier.
In particular take a look at the file named biomed2002hmm.pdf int the doc archive. It discusses the use of lexical tokenization and Hidden Markov Models to identify patterns for names and addresses.
The ideas presented could be applied to your problem of identifying personal versus business names. The project includes code examples (in Python though, not C#) of the techniques described.
Upvotes: 0