drhicks
drhicks

Reputation: 33

Faster alternative to for loop in for loop

I have a dictionary with ~150,000 keys. There are no duplicate keys. Each key is 127 characters long and each key differs at 1-11 positions (most differences happen towards the end of the key). The value for each key is a unique ID and a blank list []. For a given key I want to find all other keys that differ by exactly 1 character and then append there ID's to the given keys blank list. At the end I want a key and its value (an ID and a list of all keys that differ by one character).

My code works, but the problem is that it is too slow. The double for loop is 150,000^2 = ~25 billion. On my computer, I can go through the loop ~2 million times every minute (doing the match1 function each time). This would take ~8 days to finish. The loop without match1 function runs ~7 times faster, so would finish in ~1 day.

I am wondering if anyone knows how I can improve the speed of this?

# example dictionary
dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}


def match1(s1,s2,dict):    
    s = 0
    for c1, c2 in zip(reversed(s1), reversed(s2)):
        if s < 2:
            if c1 != c2:
                s = s + 1
        else:
            break
    if s == 1:
        dict1[s1][1].append(dict1[s2][0])


for s1 in dict:
    for s2 in dict:
        match1(s1,s2,dict)

Upvotes: 2

Views: 19436

Answers (4)

kmonsoor
kmonsoor

Reputation: 8109

For the key-matching part, use Levenshtein matching for extremely fast comparison. Python-Levenshtein is a c-extention based implementation. Use it's hamming() function to determine just number of different characters.

Install it using Git link:

pip install git+git://github.com/ztane/python-Levenshtein.git

Now, use it as below by plugging it into @tdelaney's answer:

import Levenshtein as lv

itemlist = list(_dict.items())

while itemlist:
   if lv.hamming(key1, key2) == 1:
       key1, value1 = itemlist.pop()
       for key2, value2 in itemlist:
           value1[1].append(key2)
           value2[1].append(key1)

Upvotes: 0

prem kumar
prem kumar

Reputation: 5877

Try this code:

# example dictionary
dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}


def match1(s1,s2,dict):    
    s = 0
    #reverse and zip computations are avoided
    index = 127-1
    while (index>=0 && s<2):
        if(s1[index] == s2[index]):
            s = s + 1

    if (s == 1): 
        #we are modifying both s1 and s2 instead of only s1 to improve performance
        dict1[s1][1].append(dict1[s2][0])
        dict1[s2][1].append(dict1[s1][0])

keys = dict.keys()
#no of times match1 will be invoked is (n-1)*n/2 instead of n*n
for i in range(0, len(keys)):
    for j in range(i+1, len(keys)):
        #if match1(s1,s2,dict) is invoked then no need to call match1(s2,s1,dict) because now match1 function will take care of it. So only either one needs to be called
        match1(keys[i],keys[j],dict)

Optmizations:

  • reverse and zip computations are avoided
  • for every key, comparison is made only with keys that appear later than this key in the keys list.
  • match1() modifies both s1 and s2 instead of only s1. This can be done because of commutativity i.e. s1 compared to s2 and s2 compared to s1 are the same
  • keys list is stored in a variable and accessed by index so that python will not create new temporary lists during execution

Upvotes: 1

jme
jme

Reputation: 20695

Currently you are checking each key against every other key for a total of O(n^2) comparisons. The insight is that we only need to check against a very small fraction of the other keys.

Suppose the alphabet over which the characters of each key has k distinct values. For example, if your keys are simple ASCII strings consisting of a-z and 0-9, then k = 26 + 10 = 30.

Given any key, we can generate all possible keys which are one character away: there are 127 * k such strings. Whereas before you were comparing each key to ~150,000 other keys, now we only need to compare against 127 * k, which is 3810 for the case where k = 30. This reduces overall time complexity from O(n^2) to O(n * k), where k is a constant independent of n. This is where the real speedup is when you scale up n.

Here's some code to generate all possible neighbors of a key:

def generate_neighbors(key, alphabet):
    for i in range(len(key)):
        left, right = key[:i], key[i+1:]
        for char in alphabet:
            if char != key[i]:
                yield left + char + right

So, for example:

>>> set(generate_neighbors('ab', {'a', 'b', 'c', 'd'}))
{'aa', 'ac', 'ad', 'bb', 'cb', 'db'}

Now we compute the neighborhoods of each key:

def compute_neighborhoods(data, alphabet):
    keyset = set(data.keys())
    for key in data:
        possible_neighbors = set(generate_neighbors(key, alphabet))
        neighbors = possible_neighbors & keyset

        identifier = data[key][0]

        for neighbor in neighbors:
            data[neighbor][1].append(identifier)

Now an example. Suppose

data = {
 '0a': [4, []],
 '1f': [9, []],
 '27': [3, []],
 '32': [8, []],
 '3f': [6, []],
 '47': [1, []],
 '7c': [2, []],
 'a1': [0, []],
 'c8': [7, []],
 'e2': [5, []]
}

Then:

>>> alphabet = set('abcdef01234567890')
>>> compute_neighborhoods(data, alphabet)
>>> data
{'0a': [4, []],
 '1f': [9, [6]],
 '27': [3, [1]],
 '32': [8, [5, 6]],
 '3f': [6, [8, 9]],
 '47': [1, [3]],
 '7c': [2, []],
 'a1': [0, []],
 'c8': [7, []],
 'e2': [5, [8]]}

There are a few more optimizations that I haven't implemented here. First, you say that the keys mostly differ on their later characters, and that they differ at 11 positions, at most. This means that we can be smarter about computing the intersection possible_neighbors & keyset and in generating the neighborhood. First, we amend generate_neighbors to modify the trailing characters of the key first. Then, instead of generating the whole set of neighbors at once, we generate them one at a time and check for inclusion in the data dictionary. We keep track of how many we find, and if we find 11 we break.

The reason I have not implemented this in my answer is that I'm not certain that it will result in a significant speedup, and might in fact be slower, since it means removing an optimized Python builtin (set intersection) with a pure-Python loop.

Upvotes: 3

tdelaney
tdelaney

Reputation: 77337

This is untested so may be little more than idle speculation, but... you can reduce the number of dictionary lookups (and much more importantly) eliminate half of the comparisons by building the dict into a list and only comparing remaining items in the list.

_dict = {'key1' : ['1', []], 'key2' : ['2', []], ... , 'key150000' : ['150000', []]}

# assuming python 3
itemlist = list(_dict.items())

while itemlist:
    key1, value1 = itemlist.pop()
    for key2, value2 in itemlist:
        # doesn't have early short circuit but may have fewer lookups to compensate
        if sum(c1 == c2 for c1, c2 in zip(key1, key2)) == 1:
            value1[1].append(key2)
            value2[1].append(key1)

Upvotes: 1

Related Questions