How make multi word entities a single word?

Question

I have a pandas column with multiple word entities. I want to make the entities labeled as PERSON a single word. My input looks like this:

    Text
 Vote for Donald Trump
 Vote for Barack Obama
 Vote for Bernie Sanders
 Move to another location
 Support LaVaughn Robinson
 Support Michelle LaVaughn Robinson
 Support Sanders

I need my output to look like this:

   Text
Vote for Donald_Trump
Vote for Barack_Obama
Vote for Bernie_Sanders
Move to another location
Support LaVaughn_Robinson
Support Michelle_LaVaughn_Robinson
Support Sanders

My first though was to use Spacy NER, return the PERSON and later combined the words returned, but I'm getting words that are not NER. Can I do it using BILOU? Is there any other way to do it?

Updated question

I have a new dataset that have 2 columns. Spacy NER seems to performe best, but it only identifies the NER(persons) on the column complete_text. When I run it on the column incomplete_text, it cannot identify the NER(persons). Is there a way to map the person identify on the column complete_text and matched it with the person on column incomplete_text. I'm sorry if this sounds confusing. This is how my dataset looks like:

complete_text                             incomplete_text                            
everyone to vote for Marine Le Pen     vote for Marine Le Pen

When I use spacy to get the person on both the complete_text column and the incomplete_text column, It only returns the person on the complete_text not on the incomplete_text. I want want to match the person identified on the complete_text column with the person on the incomplete_text column and return the person identified as a single word.

complete_text                           spacy_complete_text_person   incomplete_text                spacy_incomplete_text__person      result                                         
everyone to vote for Marine Le Pen     [Marine Le Pen]                 vote for Marine Le Pen                  []                      vote for Marine_Le_Pen

Corralien · Accepted Answer

You can use BERT model from dslim/bert-base-NER using transformers:

# Python env: pip install transformers
# Anaconda env: conda install transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

def process_person(txt):
    # Currently not optimized for Pandas
    l = nlp(txt)
    for d in l:
        if d['entity_group'] == 'PER':
            s = d['start']
            e = d['end']
            txt = txt[:s] + txt[s:e+1].replace(' ', '_') + txt[e:]
    return txt

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-large-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-large-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='max')
df['Text2'] = df['Text'].apply(process_person)

Output:

>>> df
                                 Text                               Text2
0               Vote for Donald Trump               Vote for Donald_Trump
1               Vote for Barack Obama               Vote for Barack_Obama
2             Vote for Bernie Sanders             Vote for Bernie_Sanders
3            Move to another location            Move to another location
4           Support LaVaughn Robinson           Support LaVaughn_Robinson
5  Support Michelle LaVaughn Robinson  Support Michelle_LaVaughn_Robinson
6                     Support Sanders                     Support Sanders
7                           RT Please                           RT Please
8                            STOP luc                            STOP luc
9                           Kick Some                           Kick Some

How make multi word entities a single word?

Answers (2)

Related Questions