Reputation: 73
I scraped articles from websites, and they contain duplicate names (missing a space after), like that:
"Bakhtiar Mohammed AbdullaBakhtiar Mohammed AbdullaA forensic analysis of Ms Borch’s computer revealed that she had watched videos of the beheadings carried out by Mohammed Emwazi or Jihadi John, the British Isil fighter."
How can I find those instances and delete the duplicate (but keep the name once)?
I tried NER with spacy (en_core_web_sm). PyCharm's output was:
Mohammed AbdullaBakhtiar Mohammed AbdullaA PERSON
Ms Borch’s ORG
Mohammed Emwazi PERSON
And running the script in a Jupyter Notebook, the output did not contain the name at all:
Ms Borch’s ORG
Mohammed Emwazi PERSON
Jihadi John PERSON
British NORP
My code snippet:
import spacy
NER = spacy.load("en_core_web_sm")
raw_text="text above"
text1 = NER(raw_text)
for word in text1.ents:
print(word.text,word.label_)
Similarly, sometimes part of something is duplicated, like so: "People greet refugees as they arrive at the main train station in Munich, GermanyPeople greet refugees as they arrive at the main train station in Munich" How do I get rid of the second part there and replace it with a space?
Upvotes: 1
Views: 87
Reputation: 520948
We could try removing such duplicate names from the source text itself, before you apply spaCy to it. Here is one way to do this using regular expressions.
import re
inp = "Bakhtiar Mohammed AbdullaBakhtiar Mohammed AbdullaA forensic analysis of Ms Borch’s computer revealed that she had watched videos of the beheadings carried out by Mohammed Emwazi or Jihadi John, the British Isil fighter"
output = re.sub(r'(\w+(?: \w+)+)\1(\w*)', lambda m: m.group(1) + (" "+ m.group(2)) if m.group(2) != " " else m.group(2), inp)
print(output)
This prints:
Bakhtiar Mohammed Abdulla A forensic analysis of Ms Borch’s computer revealed that she had watched videos of the beheadings carried out by Mohammed Emwazi or Jihadi John, the British Isil fighter
The regular expression used above matches a name which is then immediately followed by the same name, with no separating spaces. The replacement also has logic which ensures that the single name retained will be followed by a space.
Upvotes: 1