Reputation: 3885
I want to make SpaCy model that will recognise organisation names. Each organisation name have between 1 and 4 words, that can be titled or capitalised. I have added more than 3500 names of the organisations like this:
patterns = []
for organisation in organisations_list:
patterns.append({"label": "ORG", "pattern": organisation.strip()})
So now i have a list of patterns that look like this:
for p in patterns:
print(p)
result:
{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}
...
...
So patterns object look like this:
patterns = [{'label': 'ORG', 'pattern': 'BLS AG'}
{'label': 'ORG', 'pattern': 'Chemins de fer du Jura'}
{'label': 'ORG', 'pattern': 'Comlux'}
{'label': 'ORG', 'pattern': 'CRH Gétaz Group'}
{'label': 'ORG', 'pattern': 'DKSH Management AG'}
{'label': 'ORG', 'pattern': 'Ferdinand Steck Maschinenfabrik'}
{'label': 'ORG', 'pattern': 'Galenica'}
{'label': 'ORG', 'pattern': 'Givaudan'}
{'label': 'ORG', 'pattern': 'Heliswiss'}
{'label': 'ORG', 'pattern': 'Jet Aviation'}
{'label': 'ORG', 'pattern': 'Kolmar'}....]
Then I created a blank model:
nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler')
ruler.add_patterns(patterns)
And then, I have tested it like this:
for full_text in list_of_texts:
doc = nlp(full_text)
print(doc.ents.text, doc.ents.label_)
And it does not recognises anything (even if Im testing it in a sentence that has exact name of the organisations). I have also tried to add tagger
and parser
to my blank model with entity_ruler
but its always the same.
These are some of the examples of text that I have used for testing (each company name in testing texts are also in the patterns with the same capitalisations and spelling):
t1 = "I work in company called DKSH Management AG its very good company"
t2 = "I have stayed in Holiday Inn Express and I really liked it"
t3 = "Have you head for company named AKKA Technologies SE"
t4 = "what do you think about ERYTECH Pharma"
t5 = "did you get an email from ESI Group"
t6 = "Esso S.A.F. sent me an email last week"
What am I doing wrong? I have noticed that It works if I do it like this:
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe('entity_ruler', before = 'tagger')
#if i do print(nlp.pipeline) i can see entity_ruler added before tager.
But then I do not know if it works because of my entity_ruler
or because of the pre trained model. I have tested it on 20 example texts and it gives me the same results with entity_ruler and without it, so I cant figure it out if it works better or not.
What am I doing wrong?
Upvotes: 3
Views: 461
Reputation: 15593
You're not adding the EntityRuler correctly. You're creating an EntityRuler from scratch and adding rules to it, and then telling the pipeline to create an EntityRuler that's completely unrelated.
This is the problem code:
ruler = EntityRuler(nlp) # ruler 1
ruler.add_patterns(patterns) # ruler 1
nlp = spacy.blank("en")
nlp.add_pipe('entity_ruler') # this creates an unrelated ruler 2
This is what you should do:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
That should work.
In spaCy v2 the flow for creating a pipeline component was to create the object and then add it to the pipeline, but in v3 the flow is to ask the pipeline to create the component and then use the returned object.
Based on your updated examples, here is example code using the EntityRuler to match the first sentence.
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "ORG", "pattern": "DKSH Management AG"},
{"label": "ORG", "pattern": "Some other company"},
]
ruler.add_patterns(patterns)
doc = nlp("I work in company called DKSH Management AG its very good company")
print([(ent.text, ent.label_) for ent in doc.ents])
# output: [('DKSH Management AG', 'ORG')]
Does that clarify how you should structure your code?
Looking at your updated question code, your code with the blank model is almost right, but note that add_pipe returns the EntityRuler object. You should add your patterns to that object.
Upvotes: 2