john_28
john_28

Reputation: 83

Removing named entities from a document using spacy

I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.

In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.

I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.


nlp = spacy.load('en')

text_data = u'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

for word in document:
    if word not in document.ents:
        text_no_namedentities.append(word)

return " ".join(text_no_namedentities)

It creates the following error:

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)

Upvotes: 8

Views: 9959

Answers (4)

orangenarwhals
orangenarwhals

Reputation: 405

I had issue with above solutions, kochar96 and APhillips's solution modifies the text, due to spacy's tokenization, so can't --> ca n't after the join.

I couldn't quite follow Batmobil's solution, but followed the general idea of using the start and end indices.

Explanation of the hack-y numpy solution in the printout. (Don't have time to do something more reasonable, feel free to edit and improve)

text_data = "This can't be a text document that speaks about entities like Sweden and Nokia"
my_ents = [(e.start_char,e.end_char) for e in nlp(text_data).ents]
my_str = text_data

print(f'{my_ents=}')
idx_keep = [0] + np.array(my_ents).ravel().tolist() + [-1]
idx_keep = np.array(idx_keep).reshape(-1,2)
print(idx_keep)

keep_text = ''
for start_char, end_char in idx_keep:
    keep_text += my_str[start_char:end_char]
print(keep_text)
my_ents=[(62, 68), (73, 78)]
[[ 0 62]
 [68 73]
 [78 -1]]
This can't be a text document that speaks about entities like  and 

Upvotes: 0

Batmobil
Batmobil

Reputation: 11

You could use the entities attributes start_char and end_char to replace the entity by an empty string.

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [(e.start_char,e.end_char)  for e in document.ents]

for ent in ents:
    start_char, end_char = ent
    text_data = text_data[:start_char] + text_data[end_char:]  
print(text_data)

Upvotes: 1

kochar96
kochar96

Reputation: 49

This will not handle entities covering multiple tokens.

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output

'New York is in'

Here USA is correctly removed but couldn't eliminate New York

Solution

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))

Output

'is in'

Upvotes: 3

APhillips
APhillips

Reputation: 1181

This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output:

This is a text document that speaks about entities like and

Upvotes: 2

Related Questions