A S
A S

Reputation: 1235

Python: deleting similar objects from a list using difflib.SequenceMatcher

Let's say I have a list of some strings, and there are certain strings there that very, very similar. And I want to delete those almost duplicates. For that, I came up with the following code:

from difflib import SequenceMatcher

l = ['Apple', 'Appel', 'Aple', 'Mango']
c = [l[0]]

for i in l:
    count = 0
    for j in c:
        if SequenceMatcher(None, i, j).ratio() < 0.7:
            count += 1
    if count == len(c):
        c.append(i)

Which seems to work fine but I don't really like nested loops and also this count solution looks ugly. But probably it's possible to write it down in a more Pythonic way? Using generators, may be?

Would be grateful for a hint, thanks :)

Upvotes: 1

Views: 436

Answers (1)

Olivier Melan&#231;on
Olivier Melan&#231;on

Reputation: 22314

I think a cleaner way to write this would be to use difflib method get_close_matches

from difflib import get_close_matches

l = ['Apple', 'Appel', 'Aple', 'Mango']
c = []

while l:
    word = l.pop()
    c.append(word)
    l = [x for x in l if x not in get_close_matches(word, l, cutoff=0.7)]

Note that this deconstructs l so you may want to make a copy of it first.

Upvotes: 4

Related Questions