demongolem
demongolem

Reputation: 9708

In spaCy, why is '\n' constantly tagged as GPE by english NER?

I am starting to get acquainted with spaCy v2.0. When I run Lightning_Tour.py with my own documents, I am seeing that the end of line string \n is being consistently tagged as GPE in the entity output.

So is there any way to preprocess the document to discourage this tagging from taking place? Or is this the behavior of the default english model?

Upvotes: 3

Views: 396

Answers (2)

Chandan Gupta
Chandan Gupta

Reputation: 722

from bs4 import BeautifulSoup
import spacy

CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "\n".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
section_spacy = spacy_model(content)

def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

spacy_model.add_pipe(remove_whitespace_entities, after='ner')
doc = spacy_model(content)
print(doc.ents)

Upvotes: 0

demongolem
demongolem

Reputation: 9708

Yes it is the behavior of the default model currently (I am using spaCy 2.0.5) and others have seen it (see my comment above). As a workaround, one should post-process the entities generated for the time being.

Upvotes: 1

Related Questions