Reputation: 585
imagine I got a list of strings containing duplicates with different lengths:
liste = ['I am googling for the solution for an hour now',
'I am googling for the solution for an hour now --Sent via mail--',
'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
'Hello I am good thanks >> How are you?',
'Hello I am good thanks',
'Hello I am good thanks >>']
Wanted Output:
liste = ['I am googling for the solution for an hour now', 'Hello I am good thanks']
As you can see the strings are pretty close to duplicates but aren't exact duplicates. So a approach like this doesn't work:
mylist = list(dict.fromkeys(liste))
Have you got any idea how to just keep the shortest duplicate? The duplicates are always consecutive.
EDIT:
The order of the input list shouldn't be disturbed.
Upvotes: 1
Views: 251
Reputation: 73470
You can do the following:
mylist = []
for s in sorted(liste):
if not (mylist and s.startswith(mylist[-1])):
mylist.append(s)
You can then recover the original order of occurence:
mylist[:] = filter(set(mylist).__contains__, liste)
Upvotes: 4
Reputation: 2021
I will implement something not really optimized, just to explain.
Looking at the input data, I assume you want not only one shortest sentence, but the set of shortest that are not similar.
Here the proposed implementation
liste = ['I am googling for the solution for an hour now','I am googling for the solution for an hour now --Sent via mail--', 'I am googling for the solution for an hour now --Sent via mail-- What are you doing?', 'Hello I am good thanks >> How are you?','Hello I am good thanks', 'Hello I am good thanks >>']
# First you need some function to get the common part between two sentences
def common_part(sen1, sen2):
min_length = min( len(sen1), len(sen2) )
for i in range(min_length):
if sen1[i] != sen2[i]:
return sen1[:i]
# if we exit for loop sentences are equal or
# one is a substring of the other
return sen1[:min_length]
# than you can implement some logic checking for common parts
# I create a set to contains elements that has the advantage
# to avoid repetition if you add the same element twice
small_sentences = set()
# than you compare each element with others
# this is not a good algorithm, Big O notation is O(2^n)
for f1 in liste:
shortest = f1 # store the sentence and compare with all others
for f2 in liste:
common = common_part(f1, f2)
# I define a minimum of 5 chars to be considered "similar"
# Than if the common part is smaller we take it as good
if len(common) > 5 and len(shortest) > len(common):
shortest = common
small_sentences.add(shortest)
print(small_sentences)
Upvotes: 0
Reputation: 251
Here is a fully working example. By sorting the list by string length, you can only get duplicate of a given string after the string in the list. This way the code runs faster.
My code returns the smallest substrings in the list. So if a string is contains inside another, it will eliminate the second string.
str_list = [
"I am googling for the solution for an hour now",
"I am googling for the solution for an hour now --Sent via mail--",
"I am googling for the solution for an hour now --Sent via mail-- What are you doing?",
"Hello I am good thanks >> How are you?",
"Hello I am good thanks",
"Hello I am good thanks >>",
]
wrong_strings = []
str_list.sort(key=len)
for idx, item1 in enumerate(str_list):
for item2 in str_list[idx + 1 :]:
if item1 in item2:
wrong_strings.append(item2)
base_strings = [string for string in str_list if string not in wrong_strings]
print(f"{base_strings=}")
This also work when a string has no duplicate in the list.
Upvotes: 0
Reputation: 2315
Ok, so despite my suggestion in the comments to use regular expressions, I chose to try something that doesn't. Instead, I made a numpy array that tracks the similarity of strings and uses that to come up with similar strings. It's a bit klunky and the main algorithm in the nested for loop could probably be cleaned up a bit to optimize performance, but it seems to work.
I use a default when comparing an string to itself of 0.9
instead of 1 to make sure things didn't always default to themselves, but I didn't really explore if this is necessary.
import numpy as np
mylist = ['I am googling for the solution for an hour now',
'I am googling for the solution for an hour now --Sent via mail--',
'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
'Hello I am good thanks >> How are you?',
'Hello I am good thanks',
'Hello I am good thanks >>']
N = len(mylist)
overlap = np.ones((N,N))
for i in range(N):
for j in range(N):
if i == j: overlap[i,j] = 0.9
else:
x = min(len(mylist[i]), len(mylist[j]))
for k in range(x):
if mylist[i][k] != mylist[j][k]: break
overlap[i,j] = (k+1) / len(mylist[i])
newlist = []
for i in range(N):
j = np.argmax(overlap[:,i])
print(f"{mylist[i]} --> {mylist[j]}")
newlist.append(mylist[j])
#I am googling for the solution for an hour now --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- What are you doing? --> I am googling for the solution for an hour now
#Hello I am good thanks >> How are you? --> Hello I am good thanks
#Hello I am good thanks --> Hello I am good thanks
#Hello I am good thanks >> --> Hello I am good thanks
Then your new set is:
print(list(set(newlist)))
#['Hello I am good thanks', 'I am googling for the solution for an hour now']
Upvotes: 1
Reputation: 16496
I solved this by this sentence: If two strings start with two equal words, they are considered duplicates. You can change this number if you want.
list_ = ['I am googling for the solution for an hour now',
'I am googling for the solution for an hour now --Sent via mail--',
'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
'Hello I am good thanks >> How are you?',
'Hello I am good thanks',
'Hello I am good thanks >>']
all_duplicates = {}
for string in list_:
# like 'I am' or 'Hello I'
first_two_words = ' '.join(string.split(maxsplit=2)[:2])
if first_two_words in all_duplicates:
v = all_duplicates[first_two_words]
# check to see if this item of the list is considered duplicate
if first_two_words in v:
if len(string) < len(v):
all_duplicates[first_two_words] = string
else:
all_duplicates[first_two_words] = string
for k, v in all_duplicates.items():
print(f"{k} --> {v}")
output:
I am --> I am googling for the solution for an hour now
Hello I --> Hello I am good thanks
Upvotes: 0
Reputation: 73
You can sort your list by length, then loop through each element to see if the others (longer strings) start with it.
mySortedList = list(set(liste))
mySortedList.sort(key=len)
i = 0
while i<len(mySortedList)-1 :
j = i+1
while j < len(mySortedList):
if mySortedList[j].startswith(mySortedList[i]):
mySortedList.pop(j)
else:
j+=1
i+=1
Upvotes: 0