Named Entity Recognition upper case issue

Question

I recently switched the model I use for NER in spacy from en_core_web_md to xx_ent_wiki_sm.

I noticed that the new model always recognises full upper case words such as NEW JERSEY or NEW YORK as organisations. I would be able to provide training data to retrain the model, although it would be very time consuming. However I am uncertain if the model would loose the assumption that upper case words are organisations or if it would instead keep the assumption and create some exceptions for it. Does it maybe even learn that every all upper case with word with less than 5 letter is likely to be an organisation and everything with more letters not? I just dont know how exactly the training will affect the model

en_core_web_md seems to deal fine with acronyms, while ignoring words like NEW JERSEY. However the overall performance of xx_ent_wiki_sm is better for my use case

I ask because the assumption as such is still pretty useful, as it allows us to identify acronyms such as IBM as an organisation.

Ines Montani · Accepted Answer

The xx_ent_wiki_sm model was trained on Wikipedia, so it's very biased towards what Wikipedia considers and entity, and what's common in the data. (It also tends to frequently recognise "I" as an entity, since sentences in the first person are so rare on Wikipedia.) So post-training with more examples is definitely a good strategy, and what you're trying to do sounds feasible.

The best way to prevent the model from "forgetting" about the uppercase entities is to always include examples of entities that the model previously recognised correctly in the training data (see: the "catastrophic forgetting problem"). The nice thing is that you can create those programmatically by running spaCy over a bunch of text and extracting uppercase entities:

uppercase_ents = [ent for ent in doc.ents if all(t.is_upper for t in ent)]

See this section for more examples of how to create training data using spaCy. You can also use spaCy to generate the lowercase and titlecase variations of the selected entities to bootstrap your training data, which should hopefully save you a lot of time and work.

Named Entity Recognition upper case issue

Answers (1)

Related Questions