Miguel
Miguel

Reputation: 1363

Searching dictionary Vs searching sorted numpy structured array

Imagine we have an array of unique integers. Given an integer (N) of that list, I want to be able to get its index (I) in the array as fast as possible.

My idea was to generate an object that given N returns I. I though of using either a structured array with data type (N,I) and sorted by N or simply a dictionary with keys N.

The search speeds of both methods seems independent of the size of the object which leads me to believe that they are controlled by the overheads. However, I was a bit surprised to find that searching the dictionary was almost 10x faster than searching the structured array. So my questions are:

  1. Why is the dictionary so much faster than my array implementation?
  2. Is there an alternative method that is even faster than these two?

MWE:

from __future__ import division
import numpy as np
import timeit

#Time a function
def Timeme(funct,var,NN=10,NNN=10):
    for i in xrange(NN):
        start =timeit.default_timer()
        for t in xrange(NNN):
            funct(*var)
        end =timeit.default_timer()
        print str(i)+': '+str((end - start)/NNN*1000)  

#Function to build a dictionary        
def mydict(Flist):
    Mydict=dict()
    for n,i in Flist:
        Mydict[n]=i
    return Mydict

#Functions to access the data
def myfd(Mydict,vtest):
    return Mydict[vtest]

def myfs(Flist,vtest):
    n=Flist['N'].searchsorted(vtest)
    return Flist['I'][n] #Flist[n]['I'] is slower

#N=100000  
N=100

# "Allocate empty structured array"
Flist=np.empty(N,dtype=[('N','i4'),('I','i4')])

# "Fill N with randoms and I with sequence"
Flist['N'] = np.random.randint(N*1000,size=N)
Flist['I'] = np.arange(N)

# "Create test value"
ntest=np.random.randint(N)
vtest=Flist['N'][ntest]

# "Sort array on N"
Flist.sort(order='N')

# "Make dictionary"
Mydict=dict(Flist)

# "Get values"    
nrd=myfd(Mydict,vtest)
nrs=myfs(Flist,vtest)

print "Tests OK: " + str(ntest == nrd and ntest == nrs) 

print "\nSearch with Dictionary:"
Timeme(myfd,[Mydict,vtest],NN=5,NNN=100)
print "\nSearch directly in Array:"
Timeme(myfs,[Flist,vtest],NN=5,NNN=100)

Result:

Tests OK: True

Search with Dictionary:
0: 0.000404204885682
1: 0.000409016848607
2: 0.000418640774457
3: 0.000404204885682
4: 0.000394580959833

Search directly in Array:
0: 0.00455211692685
1: 0.00465798011119
2: 0.00458580066732
3: 0.00464354422242
4: 0.00476384329554

Upvotes: 1

Views: 1024

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95908

This could partially be explained by the method-call / function-call overhead. Your dictionary-search function merely does a single operation, indexing, which is translated into a call to my_dict.__getitem__(key), whereas your array-based implementation ultimately calls 3 methods, .searchsorted and __getitem__ twice. Python is a dynamic language, function-calls and, especially, method calls (because of method resolution) are expensive.

But fundamentally, your dict-based implementation should scale better. Python dict objects are highly optimized hash-maps with constant-time search, usually. Your array based implementation is binary search, so it is O(log(n)). You'll see this with a test-case where you select the worst-case, i.e., searching for an element not in the array. Given that searchsorted scales logarithmically, you might have to increase the size of your array dramatically (e.g.100x, 1000x) before you see notable runtime effects.

There is absolutely no chance you'll implement a faster look-up than the built-in dict in Python.

Upvotes: 1

Related Questions