Reputation: 4041
I am using SpaCy to get named entities. However, it always mis-tags new line symbols as named entities.
Below is the input text.
mytxt = """<?xml version="1.0"?>
<nitf>
<head>
<title>KNOW YOUR ROLE ON SUPER BOWL LIII.</title>
</head>
<body>
<body.head>
<hedline>
<hl1>KNOW YOUR ROLE ON SUPER BOWL LIII.</hl1>
</hedline>
<distributor>Gale Group</distributor>
</body.head>
<body.content>
<p>Montpelier: <org>Department of Motor Vehicles</org>, has issued the following
news release:</p>
<p>Be a designated sober driver, help save lives. Remember these tips
on game night:</p>
<p>Know your State's laws: refusing to take a breath test in many
jurisdictions could result in arrest, loss of your driver's
license, and impoundment of your vehicle. Not to mention the
embarrassment in explaining your situation to family, friends, and
employers.</p>
<p>In case of any query regarding this article or other content needs
please contact: <a href="mailto:[email protected]">[email protected]</a></p>
</body.content>
</body>
</nitf>
"""
Below is my code:
CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "\n".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
print(content)
section_spacy = spacy_model(content)
tokenized_sentences = []
for sent in section_spacy.sents:
tokenized_sentences.append(sent)
for s in tokenized_sentences:
labels = [(ent.text, ent.label_) for ent in s.ents]
print(Counter(labels))
The print out:
Counter({('\n', 'GPE'): 2, ('Department of Motor Vehicles', 'ORG'): 1})
Counter({('\n', 'GPE'): 1})
Counter({('\n', 'GPE'): 2, ('State', 'ORG'): 1})
Counter({('\n', 'GPE'): 3})
Counter({('\n', 'GPE'): 1})
I cannot believe SpaCy has such kind of misclassification. Did I miss anything?
Upvotes: 3
Views: 1407
Reputation: 722
from bs4 import BeautifulSoup
import spacy
CONTENT_XML_TAG = ('p', 'ul', 'h3', 'h1', 'h2', 'ol')
soup = BeautifulSoup(mytxt, 'xml')
spacy_model = spacy.load('en_core_web_sm')
content = "\n".join([p.get_text() for p in soup.find('body.content').findAll(CONTENT_XML_TAG)])
section_spacy = spacy_model(content)
def remove_whitespace_entities(doc):
doc.ents = [e for e in doc.ents if not e.text.isspace()]
return doc
spacy_model.add_pipe(remove_whitespace_entities, after='ner')
doc = spacy_model(content)
print(doc.ents)
Upvotes: 4