Reputation: 29
A list containing text strings (fulltexts of newspaper articles) cannot be successfully deduplicated. The only solution is to find the most common sentences, select list items containing these sentences, and then do the deduplication at the level of these sublists.
After reading through the myriad of similar questions here, I still have no solution.
Here are four different methods that I have tried:
1] x = list(dict.fromkeys(lst))
2] x = set(lst)
3] from iteration_utilities import unique_everseen
x = list(unique_everseen(lst))
4] using pandas
df = df.drop_duplicates(subset=['article_body'], keep='first')
All these return the same amount of list items.
However, when I check frequency distribution of the most common 'sentences' and search for one. I still find around 45 hits as this sentence appears in several texts, some of them being identical. when these texts are all lumped into one list, I can them use the x = list(dict.fromkeys(lst)). This results in only 9 list items.
How is this possible?
df = pd.read_json('UK data/2010-11.json')
len(df)
13288
df = df.drop_duplicates(subset=['article_body'], keep='first')
len(df)
6118
lst = df['article_body'].tolist()
len(lst)
6118
# taking this solution as a reference point, here it returns 6118 at the level
# of the whole list
len(list(dict.fromkeys(lst)))
6118
from nltk.tokenize import sent_tokenize
searchStr = 'Lines close at midnight.'
found = []
for text in lst:
sentences = sent_tokenize(text)
for sentence in sentences:
if sentence == searchStr:
found.append(text)
len(found)
45
# when the function is used only on a subset of the full-texts, it can suddenly
# identify more duplicates
len(list(dict.fromkeys(found)))
9
EDIT: Please check the full demonstration in jupyter notebook available here: https://colab.research.google.com/drive/1EF6PL8aduZIO--Ok0hGMzLWFIquz6F_L
I would expect that using the very same function on the full list would result in removing ALL duplicates, but this is clearly not the case. Why cannot I remove the duplicates from the whole list? How can I assure that each list item is compared with all the others?
Upvotes: 0
Views: 97
Reputation: 11
It sounds like whitespace may be the issue.
import re
x = list(set(map(lambda string: re.sub(r'\s+', ' ', string), lst)))
or something like it may work.
Upvotes: 1