Reputation: 359
I am using SpaCY's named entity recognition to extract the Name, Organization etc from a resume. Here is my python code.
import spacy
import PyPDF2
mypdf = open('C:\\Users\\akjain\\Downloads\\Resume\\Al Mal Capital_Nader El Boustany_BD Manager.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)
first_page = pdf_document.getPage(0)
nlp = spacy.load('en_core_web_sm')
text = first_page.extractText()
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
If I see the output, it does not look great. Name is not correctly identified. Last name is considered as Org name, Dubai is treated as Person and so on.
Here is the snapshot of resume which I got it from a public dataset.
I want to extract candidate name, Organization, location etc from set of resumes. When I read the documentation it says accuracy is more than 95% using spaCy. However in my case it is not. Is there any way to improve the accuracy of feature extraction?
Upvotes: 1
Views: 2587
Reputation: 1
Try trf model, the accuracy for the same is highest among all spacy base models for NER. You can further train it for the best fit for your use case.
Another alternative is to try training with neural network, You can follow the spacy docs given below link: https://course.spacy.io/en/chapter4
Upvotes: 0
Reputation: 3096
The spaCy NER model is trained on the OntoNotes corpus, which is a collection of telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, and weblogs. These type of texts all mainly contain full sentences, which is quite different than the resumes that you're training on. For instance, the entity "Dubai" has no grammatical context surrounding it, making it very difficult for this particular model to recognize it as a location. It is used to seeing sentences like "... while he was traveling in Dubai, ...". In general, Machine Learning performance is always bound to the specific problem domain you're training and evaluating your models on.
You could try running this with en_core_web_md
or en_core_web_lg
which are performing slightly better on OntoNotes, but will still not perform well on your specific domain texts.
To try and improve upon the accuracy, I would recommend further refining the existing model by annotating a set of resumes yourself, and feeding that training data back into the model. See the documentation here. I'm not certain how well this will work however, because like I said resumes are just harder because they have less context from sentences.
Upvotes: 6