Reputation: 1235
Let's say I have a list of some strings, and there are certain strings there that very, very similar. And I want to delete those almost duplicates. For that, I came up with the following code:
from difflib import SequenceMatcher
l = ['Apple', 'Appel', 'Aple', 'Mango']
c = [l[0]]
for i in l:
count = 0
for j in c:
if SequenceMatcher(None, i, j).ratio() < 0.7:
count += 1
if count == len(c):
c.append(i)
Which seems to work fine but I don't really like nested loops and also this count
solution looks ugly. But probably it's possible to write it down in a more Pythonic way? Using generators, may be?
Would be grateful for a hint, thanks :)
Upvotes: 1
Views: 436
Reputation: 22314
I think a cleaner way to write this would be to use difflib
method get_close_matches
from difflib import get_close_matches
l = ['Apple', 'Appel', 'Aple', 'Mango']
c = []
while l:
word = l.pop()
c.append(word)
l = [x for x in l if x not in get_close_matches(word, l, cutoff=0.7)]
Note that this deconstructs l
so you may want to make a copy of it first.
Upvotes: 4