Reputation: 61
I need to extract the names of Institutes from the given data. Institues names will look similar ( Anna University, Mashsa Institute of Techology , Banglore School of Engineering, Model Engineering College). It will be a lot of similar data. I want to extract these from text. How can I create a model to extract these names from data(I need to extract from resumes-C.V)
I tried adding new NER in spacy but even after training, the loss doesnt decrease and predictions are wrong. That is why I want to make a new model just for this.
Upvotes: 1
Views: 756
Reputation: 4900
The problem you face is solved by specialized text search and text analysis tools. Using phonetic analysis and indexes.
One of the popular text analysis tools is Elasticsearch. You index your documents and search them, using REST api.
Google also provide such tools for text analysis and indexing.
Also modern RDBMS tools like Oracle and PostgresSQL provide such features.
Good luck.
Upvotes: 0
Reputation: 506
You are doing text parsing.
I know you want to build a model for that but you can't do it without target data (examples of texts and lists of school names in those texts), which I don't think you have. I suggest you do it yourself, without a self-learned model.
Your best bet is regular expressions.
import re
sub_patterns = ['[A-Z][a-z]* University',
'University of [A-Z][a-z]*',
'Ecole [A-Z][a-z]*']
pattern = '({})'.format('|'.join(sub_patterns))
matches = re.findall(pattern, mystring)
I used the text from this site and I get :
matches
['University of Cambridge',
'University of Oxford',
'Harvard University',
'Columbia University',
'Princeton University',
'University of Chicago',
'Stanford University',
'Yale University',
'University of California',
'Humboldt University',
'Cornell University',
'University of Pennsylvania',
'University of London',
'Uppsala University',
'University of Edinburgh',
'Heidelberg University',
'University of California',
'York University',
'University of Michigan',
'Hopkins University',
'University of Vienna',
'University of G',
'State University',
'University of Bologna',
'Leipzig University',
'Maximilian University',
'University of Southern',
'University of Tokyo',
'Leiden University',
'Lund University',
'Charles University',
'University of Copenhagen',
'Ecole Normale',
'University of Manchester',
'Ecole Polytechnique',
'University of Bonn',
'University of Texas',
'Duke University',
'Mellon University',
'Azhar University',
'University of Helsinki',
'University of Virginia',
'Hebrew University',
'University of Toronto',
'University of Illinois',
'Sapienza University',
'University of Zurich',
'University of Washington',
'University of Minnesota',
'Georgetown University',
'University of Wisconsin',
'Gill University',
'University of Glasgow',
'University of Oslo',
'Peking University',
'State University',
'Brown University',
'University of T',
'Jagiellonian University',
'State University',
'Free University',
'Kyoto University',
'University of Padua',
'Waseda University',
'University of Florida',
'University of Geneva',
'State University',
'University of Jena',
'Keio University',
'University of Arizona',
'University of Maryland',
'Stockholm University',
'Boston University',
'University of Strasbourg',
'University of Tartu',
'Rutgers University',
'University of Warsaw',
'Utrecht University',
'University of North',
'Rockefeller University',
'Luther University',
'Tsinghua University',
'University of St',
'University of Amsterdam',
'Northwestern University',
'University of Notre',
'Technical University',
'University of Coimbra',
'Indiana University']
As you can see the Massachusetts Institute of Technology
doesn't appear, we get Ecole Normale
instead of Ecole Normale Superieure
, University of G
instead of University of Göttingen
(because ö is not in [a-z]) and there are other mistakes.
These are perfectly normal since the patterns I wrote are not good enough yet. It's your job to build good patterns for your data now.
You will also likely need text preprocessing to make this task easier, such as converting your text to ascii lowercase characters.
Upvotes: 1