lzc
lzc

Reputation: 1705

Letter combinations between a string in a list

I'm trying to compare the difference of a given string to a list. Precisely I'm trying to compare a given word, if only one letter of the word was different, to my list of words.

list = ['fake','bake','sake','rake'] #probably a set

If given word was take then the result would return fake bake sake rake

If the word was bare then the return is bake

The way I'm planning to do this is to split the given word into and start a loop to interchange every letter of this word with a list of the dictionary (a,b,c's). With every iteration of my loop, I plan to check if this word is in my list of words.

I calculated for just a 4 letter word, I would have to do about 26^4 loops in order to check every letter combination to match my list of words.


Can someone show me an efficient way to check combinations of a word?

Upvotes: 1

Views: 409

Answers (5)

wwii
wwii

Reputation: 23753

I like slices myself. Use a function that returns True/False to filter the list for the conditions you need/want.

orig = 'abcdef#ghijklmn'
test = 'abcdef%ghijklmn'
test_bad = 'abcdef%ghijk*mn'

def one_letter_different(s1, s2):
    """returns True if there is only one letter different between s1 and s2.

    Sequentially check each letter of each string till they don't match
    then check to see if the rest of the strings are equal.

    s1, s2 -> str
    """
    for i, c in enumerate(s1):
        if c != s2[i]:
            # test for substituition, deletion and insertion
            return (s1[i + 1:] == s2[i + 1:] or
                    s1[i:] == s2[i + 1:] or
                    s1[i+1:] == s2[i:])
    # s1 equals s2
    return False

print one_letter_different(orig, test)
print one_letter_different(orig, test_bad)

test = 'take'
print [item for item in ['fake','bake','sake','rake']
       if one_letter_different(item, test)]

test = 'bare'
print [item for item in ['fake','bake','sake','rake']
       if one_letter_different(item, test)]

Produces:

>>> 
True
False
['fake', 'bake', 'sake', 'rake']
['bake']
>>> 

The compare function could also be defined as:

from operator import ne
from itertools import izip_longest

def one_letter_different(s1, s2):
    """returns True if there is less than two letters different.

    Sequentially compare the letters of each string and sum the differences.

    s1, s2 -> str
    """
    return sum(ne(*thing) for thing in izip_longest(s1, s2, fillvalue = None)) == 1

Upvotes: 0

Igonato
Igonato

Reputation: 10807

Here is simple expression which returns number of different letters or False if strings have different length:

len(s1) == len(s2) and sum(1 for a, b in zip(s1, s2) if a != b)

And in your case:

target = 'take'
list = ['fake','bake','sake','rake']

def diff(s1, s2): 
    return len(s1) == len(s2) and sum(1 for a, b in zip(s1, s2) if a != b)

print [word for word in list if diff(word, target) == 1]

Upvotes: 0

Stefano Sanfilippo
Stefano Sanfilippo

Reputation: 33056

Try testing the word against each of the base words, letter by letter. Increase a counter on each difference found and keep track of the words with 0 or 1 differences. This is linear in the number of base words, much better than your exponential approach.

Here is a reference implementation:

def atMostOneDifference(word):
    matching = []
    for baseWord in ['fake','bake','sake','rake']:
        distance = 0
        if len(word) != len(baseWord):
            continue
        # We take the i-th letter from word and baseWord...
        for letter, baseLetter in zip(word, baseWord):
            if letter != baseLetter:
                distance += 1
        if distance <= 1:
            matching.append(baseWord)
    return matching

Upvotes: 0

beroe
beroe

Reputation: 12316

The jellyfish library can calculate a whole host of distances between words. It will probably be better to use this wheel rather than inventing one of your own.

From the example page:

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1

So applied to your question:

import jellyfish
target = 'take'
list = ['teak','fake','bake','sake','rake','sale']
outlist = [x for x in list if jellyfish.levenshtein_distance(x,target) == 1]

print outlist
['fake', 'bake', 'sake', 'rake']

Upvotes: 1

Peter DeGlopper
Peter DeGlopper

Reputation: 37344

word = 'take'
matches = []
candidate_list = ['fake','bake','sake','rake']
for candidate in candidate_list:
    differences = 0
    for (original_word_letter, candidate_word_letter) in izip(word, candidate):
        if original_word_letter != candidate_word_letter:
            differences += 1
        if differences > 1:
            break
    else:
        matches.append(candidate)

This uses the relatively obscure else clause on the for loop, which is not executed if the loop exited due to a break, and assumes that the word lengths are all equal - testing for non-equal lengths is of course straightforward.

It's a bad idea to use built-in names like list for your own variables - they're not informative and they will hide the built-in meaning within the appropriate scope.

Upvotes: 0

Related Questions