Linda Brck
Linda Brck

Reputation: 73

How can I remove duplicate names from scraped data?

I scraped articles from websites, and they contain duplicate names (missing a space after), like that:

"Bakhtiar Mohammed AbdullaBakhtiar Mohammed AbdullaA forensic analysis of Ms Borch’s computer revealed that she had watched videos of the beheadings carried out by Mohammed Emwazi or Jihadi John, the British Isil fighter."

How can I find those instances and delete the duplicate (but keep the name once)?

I tried NER with spacy (en_core_web_sm). PyCharm's output was:

Mohammed AbdullaBakhtiar Mohammed AbdullaA PERSON
Ms Borch’s ORG
Mohammed Emwazi PERSON

And running the script in a Jupyter Notebook, the output did not contain the name at all:

Ms Borch’s ORG
Mohammed Emwazi PERSON
Jihadi John PERSON
British NORP

My code snippet:

import spacy
NER = spacy.load("en_core_web_sm")
raw_text="text above"
text1 = NER(raw_text)
for word in text1.ents:
   print(word.text,word.label_)

Similarly, sometimes part of something is duplicated, like so: "People greet refugees as they arrive at the main train station in Munich, GermanyPeople greet refugees as they arrive at the main train station in Munich" How do I get rid of the second part there and replace it with a space?

Upvotes: 1

Views: 87

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520948

We could try removing such duplicate names from the source text itself, before you apply spaCy to it. Here is one way to do this using regular expressions.

import re

inp = "Bakhtiar Mohammed AbdullaBakhtiar Mohammed AbdullaA forensic analysis of Ms Borch’s computer revealed that she had watched videos of the beheadings carried out by Mohammed Emwazi or Jihadi John, the British Isil fighter"
output = re.sub(r'(\w+(?: \w+)+)\1(\w*)', lambda m: m.group(1) + (" "+ m.group(2)) if m.group(2) != " " else m.group(2), inp)
print(output)

This prints:

Bakhtiar Mohammed Abdulla A forensic analysis of Ms Borch’s computer revealed that she had watched videos of the beheadings carried out by Mohammed Emwazi or Jihadi John, the British Isil fighter

The regular expression used above matches a name which is then immediately followed by the same name, with no separating spaces. The replacement also has logic which ensures that the single name retained will be followed by a space.

Upvotes: 1

Related Questions