Ian
Ian

Reputation: 41

Efficient way of searching a list of lists according to their hash value?

I have a list of tuples with 3 members in each tuple as seen below:

[(-5092793511388848640, 'test1', 1),
 (-5092793511388848639, 'test0', 0), 
 (-5092793511388848638, 'test3', 3), 
 (-5092793511388848637, 'test2', 2), 
 (-5092793511388848636, 'test5', 5)]

The tuples are ordered in ascending according to first element of each tuple - the hash value of each key (e.g 'test0'). I want to find a quick way of searching through these tuples using binary search of their hash values to find a specific key. Problem is the quickest way I have found is using a for loop:

def get(key, D, hasher=hash):
    '''
    Returns the value in the dictionary corresponding to the given key.

    Arguements:
    key -- desired key to retrieve the value of.
    D -- intended dictionary to retrieve value from.
    hasher -- the hash function to be used on the key.
    '''
    for item in D:
        if item[0] == hash(key):
            return item[2]
    raise TypeError('Key not found in the dictionary.')

The function I have written above seems to be very slow at searching through a much longer list of tuples, lets say a list of 6000 different tuples. It also breaks if there are any hash collisions. I was wondering if there was a more efficient/quick way of searching the list for the correct tuple?

Side note: I know using dictionaries will be a much quicker and easier way to solve my problem but I'd like to avoid using them.

Upvotes: 2

Views: 1570

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

You could modify bisect to just check the first element:

def bisect_left(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi) // 2
        if a[mid][0] < x:
            lo = mid+1
        else: hi = mid
    return lo

def get_bis(key, d):
    h = hash(key)
    ind = bisect_left(d, h)
    if ind == -1:
        raise KeyError()
    for i in xrange(ind, len(d)):
        if d[i][0] != h:
            raise KeyError()
        if d[i][1] == key:
            return d[i][2]
    raise KeyError()

replicating some collisions, it does what it should:

In [41]: l = [(-5092793511388848640, 'test1', 1), (-5092793511388848639, 'test9', 0), (-5092793511388848639, 'test0', 3), (-5092793511388848637, 'test2', 2), (-5092793511388848636, 'test5', 5)]

In [42]: get("test0", l)
Out[42]: 3

In [43]: get("test1", l)
Out[43]: 1

In [44]: get(-5092793511388848639, l)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-81e928da1ac8> in <module>()
----> 1 get(-5092793511388848639, l)

<ipython-input-30-499e71432196> in get(key, d)
      6     for sub in islice(d, ind, None):
      7         if sub[0] != h:
----> 8             raise KeyError()
      9         if sub[1] == key:
     10             return sub

KeyError: 

Some timings:

In [91]: l = sorted((hash(s), s,randint(1,100000)) for s in ("".join(sample(ascii_letters,randint(10,26))) for _ in xrange(1000000)))

In [92]: l[-1]
Out[92]: (9223342880888029755, 'FocWPinpYZXjHhBqRkJxQeGMa', 43768)

In [93]: timeit get_bis(l[-1][1],l)hed 
100000 loops, best of 3: 5.29 µs per loop

In [94]: l[250000]
Out[94]: (-4616437486317828880, 'qXsybdhFPLczWwCQkm', 86136)

In [95]: timeit get_bis(l[250000][1],l)
100000 loops, best of 3: 4.4 µs per loop

In [96]: l[750000]
Out[96]: (4623630109115829672, 'dlQewhpMoBGmn', 39904)

In [97]: timeit get_bis(l[750000][1],l)
100000 loops, best of 3: 4.46 µs per loop

To get a better idea you would have to throw in collisions but to find the section that the hash may belong is pretty efficient.

Just type casting a few variables and compiling with cython:

def cython_bisect_left(a, long x, long lo=0):
   if lo < 0:
       raise ValueError('lo must be non-negative')
   cdef long hi = len(a)
   while lo < hi:
       mid = (lo + hi) // 2
       if a[mid][0] < x:
           lo = mid + 1
       else:
           hi = mid
   return lo
def cython_get(str key, d):
   cdef long h = hash(key)
   cdef ind = cython_bisect_left(d, h)
   if ind == -1:
       raise KeyError()
   for i in xrange(ind, len(d)):
       if d[i][0] != h:
           raise KeyError()
       if d[i][1] == key:
           return d[i][2]
   raise KeyError()

Gets us almost down to 1 microsecond:

In [13]: timeit cython_get(l[-1][1],l)
The slowest run took 40.77 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.44 µs per loop

In [14]: timeit cython_get(l[250000][1],l)
1000000 loops, best of 3: 1.33 µs per loop

In [15]: timeit cython_get(l[750000][1],l)
1000000 loops, best of 3: 1.33 µs per loop

Upvotes: 1

ShadowRanger
ShadowRanger

Reputation: 155418

First, prehash the key, don't do it over and over. Second, you can use next with an unpacking generator expression to optimize a bit:

def get(key, D, hasher=hash):
    keyhash = hasher(key)
    try:
        return next(v for hsh, k, v in D if keyhash == hsh and key == k)
    except StopIteration:
        raise TypeError('Key not found in the dictionary.')

That said, you claim you want to do a binary search, but the above is still a linear search, just optimized to avoid redundant work and to stop when the desired key is found (it checks hash first, assuming key comparison is expensive, then checks key equality only on hash match, since you complained about issues with duplicates). If the goal is binary search, and D is sorted by hash code, you'd want to use the bisect module. It's not trivial to do so (because bisect doesn't take a key argument like sorted), but if you could split D into two parts, one with just hash codes, and one with codes, keys and values, you could do:

import bisect

def get(key, Dhashes, D, hasher=hash):
    keyhash = hasher(key)
    # Search whole list of hashes for beginning of range with correct hash
    start = bisect.bisect_left(Dhashes, keyhash)
    # Search for end point of correct hashes (limit to entries after start for speed)
    end = bisect.bisect_right(Dhashes, keyhash, start)
    try:
        # Linear search of only start->end indices for exact key
        return next(v for hsh, k, v in D[start:end] if key == k)
    except StopIteration:
        raise TypeError('Key not found in the dictionary.')

That gets you true binary search, but as noted, requires that the hash codes be separated from the complete tuples of hashcode, key, value ahead of time, before the search. Splitting the hash codes at the time of each search wouldn't be worth it since the loop that split them off could have just found your desired value directly (it would only be worth splitting if you were performing many searches at once).

As Padraic notes in his answer, at the expense of giving up the C accelerator code, you could copy and modify the pure Python implementation of bisect.bisect_right and bisect.bisect_left changing each use of a[mid] to a[mid][0] which would get you bisect code that doesn't require you to maintain a separate list of hashes. The memory savings might be worth the higher lookup costs. Don't use itertools.islice to perform the slicing though, as islice with a start index iterates the whole list up to that point; true slicing only reads and copies what you care about. If you want to avoid the second bisect operation though, you could always write your own Sequence-optimized islice and combine it with itertools.takewhile to get a similar effect without having to calculate the end index up front. Code for that might be something like:

from itertools import takewhile

# Copied from bisect.bisect_left, with unused arguments removed and only
# index 0 of each tuple checked
def bisect_idx0_left(a, x):
    lo, hi = 0, len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if a[mid][0] < x: lo = mid+1
        else: hi = mid
    return lo

def sequence_skipper(seq, start):
    return (seq[i] for i in xrange(start, len(seq)))

def get(key, D, hasher=hash):
    keyhash = hasher(key)
    # Search whole list of hashes for beginning of range with correct hash
    start = bisect_idx0_left(D, keyhash)
    # Make lazy iterator that skips start values in the list
    # and stops producing values when the hash stops matching
    hashmatches = takewhile(lambda x: keyhash == x[0], sequence_skipper(D, start))
    try:
        # Linear search of only indices with matching hashes for exact key
        return next(v for hsh, k, v in hashmatches if key == k)
    except StopIteration:
        raise TypeError('Key not found in the dictionary.')

Note: You could save even more work at the expense of more memory, by having Dhashes actually be (hashcode, key) pairs; assuming uniqueness, this would mean a single bisect.bisect* call, not two, and no need for a scan between indices for a key match; you either found it in the binary search or you didn't. Just for example, I generated 1000 key value pairs, storing them as either (hashcode, key, value) tuples in a list (which I sorted on the hashcode), or a dict mapping keys->values. The keys were all 65 bit ints (long enough that the hash code wasn't a trivial self-mapping). Using the linear search code I provided up top, it took ~15 microseconds to find the value located at index 321; with binary search (having copied hashes only to a separate list) it took just over 2 microseconds. Looking it up in the equivalent dict took ~55 _nano_seconds; the run time overhead even for binary search worked out to ~37x, and linear search ran ~270x higher. And that's before we get into the increased memory costs, increased code complexity, and increased overhead to maintain sorted order (assuming D is ever modified).

Lastly: You say "I'd like to avoid using [dicts]", but give no explanation as to why. dicts are the correct way to solve a problem like this; assuming no self-hashing (i.e. key is an int that hashes to itself, possibly saving the cost of the hash code), the memory overhead just for the list of tuples (not including a separate list of hash codes) would be (roughly) twice that of a simple dict mapping keys to values. dict would also prevent accidentally storing duplicates, have ~O(1) insertion cost (even with bisect, insertion maintaining sorted order would have ~O(log n) lookup and O(n) memory move costs), ~O(1) lookup cost (vs. ~O(log n) with bisect), and beyond the big-O differences, would do all the work using C built-in functions that are heavily optimized, so the real savings would be greater.

Upvotes: 3

manishrw
manishrw

Reputation: 429

Try using list comprehensions. I'm not sure if it's the most efficient way, but it's the pythonic way and quite effective!

  [ x for x in D if x[0] == hash(key) ]

Upvotes: 0

Related Questions