Reputation: 585

Shortest duplicate in a list of strings

imagine I got a list of strings containing duplicates with different lengths:

liste = ['I am googling for the solution for an hour now',
         'I am googling for the solution for an hour now --Sent via mail--',
         'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
         'Hello I am good thanks >> How are you?',
         'Hello I am good thanks',
         'Hello I am good thanks >>']

Wanted Output:

liste = ['I am googling for the solution for an hour now', 'Hello I am good thanks']

As you can see the strings are pretty close to duplicates but aren't exact duplicates. So a approach like this doesn't work:

mylist = list(dict.fromkeys(liste))

Have you got any idea how to just keep the shortest duplicate? The duplicates are always consecutive.

EDIT:

The order of the input list shouldn't be disturbed.

Upvotes: 1

Answers (6)

user2390182

Reputation: 73470

You can do the following:

mylist = []
for s in sorted(liste):
    if not (mylist and s.startswith(mylist[-1])):
        mylist.append(s)

You can then recover the original order of occurence:

mylist[:] = filter(set(mylist).__contains__, liste)

Upvotes: 4

piertoni

Reputation: 2021

I will implement something not really optimized, just to explain.

Looking at the input data, I assume you want not only one shortest sentence, but the set of shortest that are not similar.

Here the proposed implementation

liste = ['I am googling for the solution for an hour now','I am googling for the solution for an hour now --Sent via mail--', 'I am googling for the solution for an hour now --Sent via mail-- What are you doing?', 'Hello I am good thanks >> How are you?','Hello I am good thanks', 'Hello I am good thanks >>']


# First you need some function to get the common part between two sentences
def common_part(sen1, sen2):
    min_length = min( len(sen1), len(sen2) )

    for i in range(min_length):
        if sen1[i] != sen2[i]:
            return sen1[:i]
    # if we exit for loop sentences are equal or
    # one is a substring of the other
    return sen1[:min_length]

# than you can implement some logic checking for common parts

# I create a set to contains elements that has the advantage
# to avoid repetition if you add the same element twice
small_sentences = set()

# than you compare each element with others
# this is not a good algorithm, Big O notation is O(2^n)

for f1 in liste:
    shortest = f1  # store the sentence and compare with all others
    for f2 in liste:
        common = common_part(f1, f2)
        # I define a minimum of 5 chars to be considered "similar"
        # Than if the common part is smaller we take it as good
        if len(common) > 5 and len(shortest) > len(common):
            shortest = common
    small_sentences.add(shortest)

print(small_sentences)

Upvotes: 0

jeandemeusy

Reputation: 251

Here is a fully working example. By sorting the list by string length, you can only get duplicate of a given string after the string in the list. This way the code runs faster.

My code returns the smallest substrings in the list. So if a string is contains inside another, it will eliminate the second string.

str_list = [
    "I am googling for the solution for an hour now",
    "I am googling for the solution for an hour now --Sent via mail--",
    "I am googling for the solution for an hour now --Sent via mail-- What are you doing?",
    "Hello I am good thanks >> How are you?",
    "Hello I am good thanks",
    "Hello I am good thanks >>",
]

wrong_strings = []
str_list.sort(key=len)

for idx, item1 in enumerate(str_list):
    for item2 in str_list[idx + 1 :]:
        if item1 in item2:
            wrong_strings.append(item2)

base_strings = [string for string in str_list if string not in wrong_strings]

print(f"{base_strings=}")

This also work when a string has no duplicate in the list.

Upvotes: 0

ramzeek

Reputation: 2315

Ok, so despite my suggestion in the comments to use regular expressions, I chose to try something that doesn't. Instead, I made a numpy array that tracks the similarity of strings and uses that to come up with similar strings. It's a bit klunky and the main algorithm in the nested for loop could probably be cleaned up a bit to optimize performance, but it seems to work.

I use a default when comparing an string to itself of 0.9 instead of 1 to make sure things didn't always default to themselves, but I didn't really explore if this is necessary.

import numpy as np

mylist = ['I am googling for the solution for an hour now',
          'I am googling for the solution for an hour now --Sent via mail--',
          'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
          'Hello I am good thanks >> How are you?',
          'Hello I am good thanks',
          'Hello I am good thanks >>']

N = len(mylist)

overlap = np.ones((N,N))

for i in range(N):
   for j in range(N):
      if i == j: overlap[i,j] = 0.9
      else:
         x = min(len(mylist[i]), len(mylist[j]))
         for k in range(x):
            if mylist[i][k] != mylist[j][k]: break
         overlap[i,j] = (k+1) / len(mylist[i])

newlist = []
for i in range(N):
   j = np.argmax(overlap[:,i])
   print(f"{mylist[i]} --> {mylist[j]}")
   newlist.append(mylist[j])
#I am googling for the solution for an hour now --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- What are you doing? --> I am googling for the solution for an hour now
#Hello I am good thanks >> How are you? --> Hello I am good thanks
#Hello I am good thanks --> Hello I am good thanks
#Hello I am good thanks >> --> Hello I am good thanks

Then your new set is:

print(list(set(newlist)))
#['Hello I am good thanks', 'I am googling for the solution for an hour now']

Upvotes: 1

S.B

Reputation: 16496

I solved this by this sentence: If two strings start with two equal words, they are considered duplicates. You can change this number if you want.

list_ = ['I am googling for the solution for an hour now',
         'I am googling for the solution for an hour now --Sent via mail--',
         'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
         'Hello I am good thanks >> How are you?',
         'Hello I am good thanks',
         'Hello I am good thanks >>']

all_duplicates = {}

for string in list_:

    # like 'I am' or 'Hello I'
    first_two_words = ' '.join(string.split(maxsplit=2)[:2])

    if first_two_words in all_duplicates:

        v = all_duplicates[first_two_words]

        # check to see if this item of the list is considered duplicate
        if first_two_words in v:
            if len(string) < len(v):
                all_duplicates[first_two_words] = string
    else:
        all_duplicates[first_two_words] = string


for k, v in all_duplicates.items():
    print(f"{k} --> {v}")

output:

I am --> I am googling for the solution for an hour now
Hello I --> Hello I am good thanks

Upvotes: 0

Blumer

Reputation: 73

You can sort your list by length, then loop through each element to see if the others (longer strings) start with it.

mySortedList = list(set(liste))
mySortedList.sort(key=len)
i = 0
while i<len(mySortedList)-1 :
    j = i+1
    while j < len(mySortedList):
        if mySortedList[j].startswith(mySortedList[i]):
            mySortedList.pop(j)
        else:
            j+=1
    i+=1

Upvotes: 0

Shortest duplicate in a list of strings

Answers (6)

Related Questions