Jan Pesl
Jan Pesl

Reputation: 29

Duplicate strings in a list not removed unless the most similar ones are in a sublist

A list containing text strings (fulltexts of newspaper articles) cannot be successfully deduplicated. The only solution is to find the most common sentences, select list items containing these sentences, and then do the deduplication at the level of these sublists.

After reading through the myriad of similar questions here, I still have no solution.

Here are four different methods that I have tried:

1] x = list(dict.fromkeys(lst))
2] x = set(lst)
3] from iteration_utilities import unique_everseen
   x = list(unique_everseen(lst))
4] using pandas
   df = df.drop_duplicates(subset=['article_body'], keep='first')

All these return the same amount of list items.

However, when I check frequency distribution of the most common 'sentences' and search for one. I still find around 45 hits as this sentence appears in several texts, some of them being identical. when these texts are all lumped into one list, I can them use the x = list(dict.fromkeys(lst)). This results in only 9 list items.

How is this possible?

df = pd.read_json('UK data/2010-11.json')
len(df)
13288

df = df.drop_duplicates(subset=['article_body'], keep='first')
len(df)
6118

lst = df['article_body'].tolist()
len(lst)
6118

# taking this solution as a reference point, here it returns 6118 at the level
# of the whole list

len(list(dict.fromkeys(lst)))
6118

from nltk.tokenize import sent_tokenize

searchStr = 'Lines close at midnight.'
found = []

for text in lst:
    sentences = sent_tokenize(text)
    for sentence in sentences:
        if sentence == searchStr:
            found.append(text)

len(found)
45

# when the function is used only on a subset of the full-texts, it can suddenly 
# identify more duplicates

len(list(dict.fromkeys(found)))
9

EDIT: Please check the full demonstration in jupyter notebook available here: https://colab.research.google.com/drive/1EF6PL8aduZIO--Ok0hGMzLWFIquz6F_L

I would expect that using the very same function on the full list would result in removing ALL duplicates, but this is clearly not the case. Why cannot I remove the duplicates from the whole list? How can I assure that each list item is compared with all the others?

Upvotes: 0

Views: 97

Answers (1)

Steven D Riggs
Steven D Riggs

Reputation: 11

It sounds like whitespace may be the issue.

import re

x = list(set(map(lambda string: re.sub(r'\s+', ' ', string), lst)))

or something like it may work.

Upvotes: 1

Related Questions