Michael Davidson
Michael Davidson

Reputation: 1411

Transformer Pipeline for NER returns partial words with ##s

How should I interpret the partial words with '##'s in them returned by the Transformer NER pipelines? Other tools like Flair and SpaCy return the word and their tag. I have worked with the CONLL dataset before and never noticed anything like this. Moreover, why are words being divided like this?

Example from the HuggingFace:

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

Output:

[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

Upvotes: 1

Views: 1921

Answers (2)

Amir
Amir

Reputation: 16587

Use the aggregation_strategy to group the entities:

pipeline('ner', model="YOUR_MODEL", aggregation_strategy="average")

Read more about strategies here.

Upvotes: 1

mrbTT
mrbTT

Reputation: 1409

Pytorch transformers and BERT make 2 tokens, the regular words as tokens and words + sub-words as tokens; which divide words by their base meaning + their complement, addin "##" at the start.

Let's say you have the phrease: I like hugging animals

The first set of tokens would be:

["I", "like", "hugging", "animals"]

And the second list with the sub-words would be:

["I", "like", "hug", "##gging", "animal", "##s"]

You can learn more here: https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer

Upvotes: 3

Related Questions