Reputation: 1705
I'm trying to compare the difference of a given string to a list. Precisely I'm trying to compare a given word, if only one letter of the word was different, to my list of words.
list = ['fake','bake','sake','rake'] #probably a set
If given word was take
then the result would return fake bake sake rake
If the word was bare
then the return is bake
The way I'm planning to do this is to split the given word into and start a loop to interchange every letter of this word with a list of the dictionary (a,b,c's). With every iteration of my loop, I plan to check if this word is in my list of words.
I calculated for just a 4 letter word, I would have to do about 26^4 loops in order to check every letter combination to match my list of words.
Can someone show me an efficient way to check combinations of a word?
Upvotes: 1
Views: 409
Reputation: 23753
I like slices myself. Use a function that returns True/False to filter the list for the conditions you need/want.
orig = 'abcdef#ghijklmn'
test = 'abcdef%ghijklmn'
test_bad = 'abcdef%ghijk*mn'
def one_letter_different(s1, s2):
"""returns True if there is only one letter different between s1 and s2.
Sequentially check each letter of each string till they don't match
then check to see if the rest of the strings are equal.
s1, s2 -> str
"""
for i, c in enumerate(s1):
if c != s2[i]:
# test for substituition, deletion and insertion
return (s1[i + 1:] == s2[i + 1:] or
s1[i:] == s2[i + 1:] or
s1[i+1:] == s2[i:])
# s1 equals s2
return False
print one_letter_different(orig, test)
print one_letter_different(orig, test_bad)
test = 'take'
print [item for item in ['fake','bake','sake','rake']
if one_letter_different(item, test)]
test = 'bare'
print [item for item in ['fake','bake','sake','rake']
if one_letter_different(item, test)]
Produces:
>>>
True
False
['fake', 'bake', 'sake', 'rake']
['bake']
>>>
The compare function could also be defined as:
from operator import ne
from itertools import izip_longest
def one_letter_different(s1, s2):
"""returns True if there is less than two letters different.
Sequentially compare the letters of each string and sum the differences.
s1, s2 -> str
"""
return sum(ne(*thing) for thing in izip_longest(s1, s2, fillvalue = None)) == 1
Upvotes: 0
Reputation: 10807
Here is simple expression which returns number of different letters or False
if strings have different length:
len(s1) == len(s2) and sum(1 for a, b in zip(s1, s2) if a != b)
And in your case:
target = 'take'
list = ['fake','bake','sake','rake']
def diff(s1, s2):
return len(s1) == len(s2) and sum(1 for a, b in zip(s1, s2) if a != b)
print [word for word in list if diff(word, target) == 1]
Upvotes: 0
Reputation: 33056
Try testing the word against each of the base words, letter by letter. Increase a counter on each difference found and keep track of the words with 0 or 1 differences. This is linear in the number of base words, much better than your exponential approach.
Here is a reference implementation:
def atMostOneDifference(word):
matching = []
for baseWord in ['fake','bake','sake','rake']:
distance = 0
if len(word) != len(baseWord):
continue
# We take the i-th letter from word and baseWord...
for letter, baseLetter in zip(word, baseWord):
if letter != baseLetter:
distance += 1
if distance <= 1:
matching.append(baseWord)
return matching
Upvotes: 0
Reputation: 12316
The jellyfish library can calculate a whole host of distances between words. It will probably be better to use this wheel rather than inventing one of your own.
From the example page:
>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
So applied to your question:
import jellyfish
target = 'take'
list = ['teak','fake','bake','sake','rake','sale']
outlist = [x for x in list if jellyfish.levenshtein_distance(x,target) == 1]
print outlist
['fake', 'bake', 'sake', 'rake']
Upvotes: 1
Reputation: 37344
word = 'take'
matches = []
candidate_list = ['fake','bake','sake','rake']
for candidate in candidate_list:
differences = 0
for (original_word_letter, candidate_word_letter) in izip(word, candidate):
if original_word_letter != candidate_word_letter:
differences += 1
if differences > 1:
break
else:
matches.append(candidate)
This uses the relatively obscure else
clause on the for loop, which is not executed if the loop exited due to a break
, and assumes that the word lengths are all equal - testing for non-equal lengths is of course straightforward.
It's a bad idea to use built-in names like list
for your own variables - they're not informative and they will hide the built-in meaning within the appropriate scope.
Upvotes: 0