NER using Spacy library not giving correct result on resume parser

Question

I am using SpaCY's named entity recognition to extract the Name, Organization etc from a resume. Here is my python code.

import spacy
import PyPDF2
mypdf = open('C:\Users\akjain\Downloads\Resume\Al Mal Capital_Nader El Boustany_BD Manager.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)
first_page = pdf_document.getPage(0)
nlp = spacy.load('en_core_web_sm') 
text = first_page.extractText()
doc = nlp(text)   
for ent in doc.ents: 
    print(ent.text, ent.label_)

If I see the output, it does not look great. Name is not correctly identified. Last name is considered as Org name, Dubai is treated as Person and so on.

Here is the snapshot of resume which I got it from a public dataset.

I want to extract candidate name, Organization, location etc from set of resumes. When I read the documentation it says accuracy is more than 95% using spaCy. However in my case it is not. Is there any way to improve the accuracy of feature extraction?

Sofie VL · Accepted Answer

The spaCy NER model is trained on the OntoNotes corpus, which is a collection of telephone conversations, newswire, newsgroups, broadcast news, broadcast conversation, and weblogs. These type of texts all mainly contain full sentences, which is quite different than the resumes that you're training on. For instance, the entity "Dubai" has no grammatical context surrounding it, making it very difficult for this particular model to recognize it as a location. It is used to seeing sentences like "... while he was traveling in Dubai, ...". In general, Machine Learning performance is always bound to the specific problem domain you're training and evaluating your models on.

You could try running this with en_core_web_md or en_core_web_lg which are performing slightly better on OntoNotes, but will still not perform well on your specific domain texts.

To try and improve upon the accuracy, I would recommend further refining the existing model by annotating a set of resumes yourself, and feeding that training data back into the model. See the documentation here. I'm not certain how well this will work however, because like I said resumes are just harder because they have less context from sentences.

NER using Spacy library not giving correct result on resume parser

Answers (2)

Related Questions