CPDatascience
CPDatascience

Reputation: 103

How make multi word entities a single word?

I have a pandas column with multiple word entities. I want to make the entities labeled as PERSON a single word. My input looks like this:

    Text
 Vote for Donald Trump
 Vote for Barack Obama
 Vote for Bernie Sanders
 Move to another location
 Support LaVaughn Robinson
 Support Michelle LaVaughn Robinson
 Support Sanders

I need my output to look like this:

   Text
Vote for Donald_Trump
Vote for Barack_Obama
Vote for Bernie_Sanders
Move to another location
Support LaVaughn_Robinson
Support Michelle_LaVaughn_Robinson
Support Sanders

My first though was to use Spacy NER, return the PERSON and later combined the words returned, but I'm getting words that are not NER. Can I do it using BILOU? Is there any other way to do it?

Updated question

I have a new dataset that have 2 columns. Spacy NER seems to performe best, but it only identifies the NER(persons) on the column complete_text. When I run it on the column incomplete_text, it cannot identify the NER(persons). Is there a way to map the person identify on the column complete_text and matched it with the person on column incomplete_text. I'm sorry if this sounds confusing. This is how my dataset looks like:

complete_text                             incomplete_text                            
everyone to vote for Marine Le Pen     vote for Marine Le Pen

When I use spacy to get the person on both the complete_text column and the incomplete_text column, It only returns the person on the complete_text not on the incomplete_text. I want want to match the person identified on the complete_text column with the person on the incomplete_text column and return the person identified as a single word.

complete_text                           spacy_complete_text_person   incomplete_text                spacy_incomplete_text__person      result                                         
everyone to vote for Marine Le Pen     [Marine Le Pen]                 vote for Marine Le Pen                  []                      vote for Marine_Le_Pen

Upvotes: 1

Views: 702

Answers (2)

Corralien
Corralien

Reputation: 120409

You can use BERT model from dslim/bert-base-NER using transformers:

# Python env: pip install transformers
# Anaconda env: conda install transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

def process_person(txt):
    # Currently not optimized for Pandas
    l = nlp(txt)
    for d in l:
        if d['entity_group'] == 'PER':
            s = d['start']
            e = d['end']
            txt = txt[:s] + txt[s:e+1].replace(' ', '_') + txt[e:]
    return txt

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-large-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-large-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='max')
df['Text2'] = df['Text'].apply(process_person)

Output:

>>> df
                                 Text                               Text2
0               Vote for Donald Trump               Vote for Donald_Trump
1               Vote for Barack Obama               Vote for Barack_Obama
2             Vote for Bernie Sanders             Vote for Bernie_Sanders
3            Move to another location            Move to another location
4           Support LaVaughn Robinson           Support LaVaughn_Robinson
5  Support Michelle LaVaughn Robinson  Support Michelle_LaVaughn_Robinson
6                     Support Sanders                     Support Sanders
7                           RT Please                           RT Please
8                            STOP luc                            STOP luc
9                           Kick Some                           Kick Some

Upvotes: 1

BeRT2me
BeRT2me

Reputation: 13242

python -m spacy download en_core_web_trf

Given:

                                 Text
0               Vote for Donald Trump
1               Vote for Barack Obama
2             Vote for Bernie Sanders
3            Move to another location
4           Support LaVaughn Robinson
5  Support Michelle LaVaughn Robinson
6                     Support Sanders
7                  Vote for RT Please
8                            STOP luc
9                   Support Kick Some
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_trf")

df.Text = df.Text.apply(nlp)

print(df.Text.apply(lambda x: x.ents))
print(df.Text.apply(lambda t: [x.label_ for x in t.ents]))

Output:

0                   ((Donald, Trump),)
1                   ((Barack, Obama),)
2                 ((Bernie, Sanders),)
3                                   ()
4              ((LaVaughn, Robinson),)
5    ((Michelle, LaVaughn, Robinson),)
6                         ((Sanders),)
7                                   ()
8                                   ()
9                                   ()

0    [PERSON]
1    [PERSON]
2    [PERSON]
3          []
4    [PERSON]
5    [PERSON]
6    [PERSON]
7          []
8          []
9          []
Name: Text, dtype: object

Seems to be finding things fairly well for me~

Upvotes: 1

Related Questions