Reputation: 2158
I have a bunch of rows that have text data in sentences. I am trying to apply entity extraction with Spacy to get organization and location.
I am able to pass in a string and get the entities. However if I apply tgat to a dataframe, it fails and here is the error. I am not sure if I wrote for loop incorrectly or not calling (X.text, X.label_) correctly? Is there a way to apply Spacy to a dataframe rows?
Dataframe not working:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp = spacy.load("en")
id1 = [1,2,3]
text = ['University of California has great research located in San Diego',np.NaN,'MIT is at Boston']
df = pd.DataFrame({'id':id1,'text':text})
df['text'] = df['text'].astype(str)
print(df)
'''
id text
0 1 University of California has great research located in San Diego
1 2 nan
2 3 MIT is at Boston
'''
# works: passing nlp function from spacy
df['text'] = df['text'].apply(lambda x: nlp(x)) # tokenized it
print(df['text'])
# fails
for row in df.iterrows():
# getting: AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'label_'
test = [(X.text, X.label_) for X in df['text']]
print(test)
String working:
sentence = 'University of California has great research located in San Diego'
result = nlp(sentence)
print([(X.text, X.label_) for X in result.ents])
'''
[('University of California', 'ORG'), ('San Diego', 'GPE')]
'''
How do I get results like this?:
id text spacy_results
0 1 University of California has great research located in San Diego [('University of California', 'ORG'), ('San Diego', 'GPE')]
1 2 nan nan
2 3 MIT is at Boston [('MIT', 'ORG'), ('Boston', 'GPE')]
Upvotes: 1
Views: 2076
Reputation: 11
Here is code :
text = [[1, 'University of California has great research located in San Diego'],[2, 'MIT is at Boston']]
df = pd.DataFrame(text, columns = ['id', 'text'])
def spacy_entity(df):
df1 = nlp(df)
df2 = [[w.text,w.label_] for w in df1.ents]
return df2
df1['new_text'] = df1['text'].apply(spacy_entity)
print(df1['new_text'])
0 [[University of California, ORG], [San Diego, ...
1 [[MIT, ORG], [Boston, GPE]]
Upvotes: 1
Reputation: 722
text = [[1, 'University of California has great research located in San Diego'],[2, 'MIT is at Boston']]
df = pd.DataFrame(text, columns = ['id', 'text'])
df['new_text'] = df['text'].apply(lambda x: list(nlp(x).ents))
print(df["text"])
Upvotes: 0