Reputation: 103
I have a pandas column with multiple word entities. I want to make the entities labeled as PERSON a single word. My input looks like this:
Text
Vote for Donald Trump
Vote for Barack Obama
Vote for Bernie Sanders
Move to another location
Support LaVaughn Robinson
Support Michelle LaVaughn Robinson
Support Sanders
I need my output to look like this:
Text
Vote for Donald_Trump
Vote for Barack_Obama
Vote for Bernie_Sanders
Move to another location
Support LaVaughn_Robinson
Support Michelle_LaVaughn_Robinson
Support Sanders
My first though was to use Spacy NER, return the PERSON and later combined the words returned, but I'm getting words that are not NER. Can I do it using BILOU? Is there any other way to do it?
Updated question
I have a new dataset that have 2 columns. Spacy NER seems to performe best, but it only identifies the NER(persons) on the column complete_text. When I run it on the column incomplete_text, it cannot identify the NER(persons). Is there a way to map the person identify on the column complete_text and matched it with the person on column incomplete_text. I'm sorry if this sounds confusing. This is how my dataset looks like:
complete_text incomplete_text
everyone to vote for Marine Le Pen vote for Marine Le Pen
When I use spacy to get the person on both the complete_text column and the incomplete_text column, It only returns the person on the complete_text not on the incomplete_text. I want want to match the person identified on the complete_text column with the person on the incomplete_text column and return the person identified as a single word.
complete_text spacy_complete_text_person incomplete_text spacy_incomplete_text__person result
everyone to vote for Marine Le Pen [Marine Le Pen] vote for Marine Le Pen [] vote for Marine_Le_Pen
Upvotes: 1
Views: 702
Reputation: 120409
You can use BERT
model from dslim/bert-base-NER
using transformers
:
# Python env: pip install transformers
# Anaconda env: conda install transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
def process_person(txt):
# Currently not optimized for Pandas
l = nlp(txt)
for d in l:
if d['entity_group'] == 'PER':
s = d['start']
e = d['end']
txt = txt[:s] + txt[s:e+1].replace(' ', '_') + txt[e:]
return txt
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-large-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-large-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='max')
df['Text2'] = df['Text'].apply(process_person)
Output:
>>> df
Text Text2
0 Vote for Donald Trump Vote for Donald_Trump
1 Vote for Barack Obama Vote for Barack_Obama
2 Vote for Bernie Sanders Vote for Bernie_Sanders
3 Move to another location Move to another location
4 Support LaVaughn Robinson Support LaVaughn_Robinson
5 Support Michelle LaVaughn Robinson Support Michelle_LaVaughn_Robinson
6 Support Sanders Support Sanders
7 RT Please RT Please
8 STOP luc STOP luc
9 Kick Some Kick Some
Upvotes: 1
Reputation: 13242
python -m spacy download en_core_web_trf
Given:
Text
0 Vote for Donald Trump
1 Vote for Barack Obama
2 Vote for Bernie Sanders
3 Move to another location
4 Support LaVaughn Robinson
5 Support Michelle LaVaughn Robinson
6 Support Sanders
7 Vote for RT Please
8 STOP luc
9 Support Kick Some
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_trf")
df.Text = df.Text.apply(nlp)
print(df.Text.apply(lambda x: x.ents))
print(df.Text.apply(lambda t: [x.label_ for x in t.ents]))
Output:
0 ((Donald, Trump),)
1 ((Barack, Obama),)
2 ((Bernie, Sanders),)
3 ()
4 ((LaVaughn, Robinson),)
5 ((Michelle, LaVaughn, Robinson),)
6 ((Sanders),)
7 ()
8 ()
9 ()
0 [PERSON]
1 [PERSON]
2 [PERSON]
3 []
4 [PERSON]
5 [PERSON]
6 [PERSON]
7 []
8 []
9 []
Name: Text, dtype: object
Seems to be finding things fairly well for me~
Upvotes: 1