Reputation: 83
I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.
In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.
I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.
nlp = spacy.load('en')
text_data = u'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
for word in document:
if word not in document.ents:
text_no_namedentities.append(word)
return " ".join(text_no_namedentities)
It creates the following error:
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)
Upvotes: 8
Views: 9959
Reputation: 405
I had issue with above solutions, kochar96 and APhillips's solution modifies the text, due to spacy's tokenization, so can't --> ca n't after the join.
I couldn't quite follow Batmobil's solution, but followed the general idea of using the start and end indices.
Explanation of the hack-y numpy solution in the printout. (Don't have time to do something more reasonable, feel free to edit and improve)
text_data = "This can't be a text document that speaks about entities like Sweden and Nokia"
my_ents = [(e.start_char,e.end_char) for e in nlp(text_data).ents]
my_str = text_data
print(f'{my_ents=}')
idx_keep = [0] + np.array(my_ents).ravel().tolist() + [-1]
idx_keep = np.array(idx_keep).reshape(-1,2)
print(idx_keep)
keep_text = ''
for start_char, end_char in idx_keep:
keep_text += my_str[start_char:end_char]
print(keep_text)
my_ents=[(62, 68), (73, 78)]
[[ 0 62]
[68 73]
[78 -1]]
This can't be a text document that speaks about entities like and
Upvotes: 0
Reputation: 11
You could use the entities attributes start_char and end_char to replace the entity by an empty string.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [(e.start_char,e.end_char) for e in document.ents]
for ent in ents:
start_char, end_char = ent
text_data = text_data[:start_char] + text_data[end_char:]
print(text_data)
Upvotes: 1
Reputation: 49
This will not handle entities covering multiple tokens.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output
'New York is in'
Here USA
is correctly removed but couldn't eliminate New York
Solution
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))
Output
'is in'
Upvotes: 3
Reputation: 1181
This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output:
This is a text document that speaks about entities like and
Upvotes: 2