SpaCy tags new line (
) as GPE named entities

Question

I am using SpaCy to get named entities. However, it always mis-tags new line symbols as named entities.

Below is the input text.

mytxt = """




KNOW YOUR ROLE ON SUPER BOWL LIII.







KNOW YOUR ROLE ON SUPER BOWL LIII.


Gale Group




Montpelier: Department of Motor Vehicles, has issued the following
news release:

Be a designated sober driver, help save lives. Remember these tips
on game night:

Know your State's laws: refusing to take a breath test in many
jurisdictions could result in arrest, loss of your driver's
license, and impoundment of your vehicle. Not to mention the
embarrassment in explaining your situation to family, friends, and
employers.

In case of any query regarding this article or other content needs
please contact: editorial@plusmediasolutions.com






"""

Below is my code:

    CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
    soup = BeautifulSoup(mytxt, 'xml')
    spacy_model = spacy.load('en_core_web_sm')
    content = "
".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
    print(content)

    section_spacy = spacy_model(content)
    tokenized_sentences = []
    for sent in section_spacy.sents:
        tokenized_sentences.append(sent)
    for s in tokenized_sentences:
        labels = [(ent.text, ent.label_) for ent in s.ents]
        print(Counter(labels))

The print out:

Counter({('
', 'GPE'): 2, ('Department of Motor Vehicles', 'ORG'): 1})
Counter({('
', 'GPE'): 1})
Counter({('
', 'GPE'): 2, ('State', 'ORG'): 1})
Counter({('
', 'GPE'): 3})
Counter({('
', 'GPE'): 1})

I cannot believe SpaCy has such kind of misclassification. Did I miss anything?

Chandan Gupta · Accepted Answer

from bs4 import BeautifulSoup
import spacy

CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "
".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
section_spacy = spacy_model(content)

def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

spacy_model.add_pipe(remove_whitespace_entities, after='ner')
doc = spacy_model(content)
print(doc.ents)

SpaCy tags new line (\n) as GPE named entities

Answers (1)

Related Questions