Reputation: 19664
I'm looking at the documentation for Huggingface pipeline for Named Entity Recognition, and it's not clear to me how these results are meant to be used in an actual entity recognition model.
For instance, given the example in documentation:
>>> from transformers import pipeline
>>> nlp = pipeline("ner")
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
... "close to the Manhattan Bridge which is visible from the window."
This outputs a list of all words that have been identified as an entity from the 9 classes defined above. Here is the expected results:
print(nlp(sequence))
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
While this alone is impressive, it isn't clear to me the correct way to get "DUMBO" from:
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
---or even to the cleaner multiple token matches, like distinguishing "New York City" from simply the city of "York."
While I can imagine heuristic methods, what's the correct intended way to join these tokens back into correct labels given your inputs?
Upvotes: 12
Views: 6437
Reputation: 19310
The pipeline object can do that for you when you set the parameter:
True
.simple
from transformers import pipeline
#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)
ner = pipeline("ner", aggregation_strategy='simple')
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
output = ner(sequence)
print(output)
Output:
[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
Upvotes: 16
Reputation: 1975
Quick update: grouped_entities
has been deprecated.
UserWarning:
grouped_entities
is deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="AggregationStrategy.SIMPLE"
instead.
f'grouped_entities
is deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="{aggregation_strategy}"
instead.'
you will have to change your code to:
ner = pipeline("ner", aggregation_stategy="simple")
Upvotes: 3