Mittenchops
Mittenchops

Reputation: 19664

Named Entity Recognition with Huggingface transformers, mapping back to complete entities

I'm looking at the documentation for Huggingface pipeline for Named Entity Recognition, and it's not clear to me how these results are meant to be used in an actual entity recognition model.

For instance, given the example in documentation:

>>> from transformers import pipeline

>>> nlp = pipeline("ner")

>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
...            "close to the Manhattan Bridge which is visible from the window."

This outputs a list of all words that have been identified as an entity from the 9 classes     defined above. Here is the expected results:

print(nlp(sequence))

[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

While this alone is impressive, it isn't clear to me the correct way to get "DUMBO" from:

{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},

---or even to the cleaner multiple token matches, like distinguishing "New York City" from simply the city of "York."

While I can imagine heuristic methods, what's the correct intended way to join these tokens back into correct labels given your inputs?

Upvotes: 12

Views: 6437

Answers (2)

cronoik
cronoik

Reputation: 19310

The pipeline object can do that for you when you set the parameter:

from transformers import pipeline

#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)

ner = pipeline("ner", aggregation_strategy='simple')

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."

output = ner(sequence)

print(output)

Output:

[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]

Upvotes: 16

SilentCloud
SilentCloud

Reputation: 1975

Quick update: grouped_entities has been deprecated.

UserWarning: grouped_entities is deprecated and will be removed in version v5.0.0, defaulted to aggregation_strategy="AggregationStrategy.SIMPLE" instead.
f'grouped_entities is deprecated and will be removed in version v5.0.0, defaulted to aggregation_strategy="{aggregation_strategy}" instead.'

you will have to change your code to:

ner = pipeline("ner", aggregation_stategy="simple")

Upvotes: 3

Related Questions