Joe
Joe

Reputation: 1768

spaCy, NER, documentation about the different label types of a particular LM

I am using spaCy for named entity recognition (NER). According to the spaCy docs, the language model en_core_web_sm is able to recognize 18 different entity types, i. e., it provides 18 labels such as DATE, PERSON or ORG.

I am particularly interested in the labels LOC (location), FAC (facilities) and GPE (gepolitical entities). Is there documentation about which objects are typically labelled with those labels? Have the guidelines used for labelling the entities been published?

I am asking, because sometimes it is not clear to me why a particular object is labelled as say FAC and not as GPE, or why an object is not labelled at all. Let's have a look at an example:

#spacy.cli.download('en_core_web_sm')
nlp = spacy.load('en_core_web_sm')
text = 'Alice Miller went to the Empire State Building. Next she went to Times Square. Finally she went to the train station.'
doc = nlp_en(text)
displacy.render(doc, style='ent')

The output is:

enter image description here

To my mind, the Empire State Building is correctly labelled as GPE. The Times Square, however, is labelled as FAC; I expected GPE. And "train station" is not recognized at all; I expected FAC.

Upvotes: 3

Views: 3343

Answers (1)

polm23
polm23

Reputation: 15593

If you check the page for a pipeline you'll see the data sources listed. For the NER data in the English pipelines OntoNotes is used. The schema is documented in the OntoNotes Manual, for example:

PERSON People, including fictional
NORP Nationalities or religious or political groups
FACILITY Buildings, airports, highways, bridges, etc.
ORGANIZATION Companies, agencies, institutions, etc.
GPE Countries, cities, states
LOCATION Non-GPE locations, mountain ranges, bodies of water
PRODUCT Vehicles, weapons, foods, etc. (Not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK OF ART Titles of books, songs, etc.
LAW Named documents made into laws 
LANGUAGE Any named language 

In spacy you can get these definitions using spacy.explain, like spacy.explain("FACILITY"). Sometimes the official documentation has more detailed explanations, though in this case it seems not to.

"train station" is not picked up because it is not a named entity - named entities are typically proper nouns, not common nouns.

Also note the model is not perfect and it will make mistakes, and it is hard to explain individual mistakes (see here).

Upvotes: 3

Related Questions