Reputation: 1363
Imagine we have an array of unique integers. Given an integer (N
) of that list, I want to be able to get its index (I
) in the array as fast as possible.
My idea was to generate an object that given N
returns I
. I though of using either a structured array with data type (N,I)
and sorted by N
or simply a dictionary with keys N
.
The search speeds of both methods seems independent of the size of the object which leads me to believe that they are controlled by the overheads. However, I was a bit surprised to find that searching the dictionary was almost 10x faster than searching the structured array. So my questions are:
from __future__ import division
import numpy as np
import timeit
#Time a function
def Timeme(funct,var,NN=10,NNN=10):
for i in xrange(NN):
start =timeit.default_timer()
for t in xrange(NNN):
funct(*var)
end =timeit.default_timer()
print str(i)+': '+str((end - start)/NNN*1000)
#Function to build a dictionary
def mydict(Flist):
Mydict=dict()
for n,i in Flist:
Mydict[n]=i
return Mydict
#Functions to access the data
def myfd(Mydict,vtest):
return Mydict[vtest]
def myfs(Flist,vtest):
n=Flist['N'].searchsorted(vtest)
return Flist['I'][n] #Flist[n]['I'] is slower
#N=100000
N=100
# "Allocate empty structured array"
Flist=np.empty(N,dtype=[('N','i4'),('I','i4')])
# "Fill N with randoms and I with sequence"
Flist['N'] = np.random.randint(N*1000,size=N)
Flist['I'] = np.arange(N)
# "Create test value"
ntest=np.random.randint(N)
vtest=Flist['N'][ntest]
# "Sort array on N"
Flist.sort(order='N')
# "Make dictionary"
Mydict=dict(Flist)
# "Get values"
nrd=myfd(Mydict,vtest)
nrs=myfs(Flist,vtest)
print "Tests OK: " + str(ntest == nrd and ntest == nrs)
print "\nSearch with Dictionary:"
Timeme(myfd,[Mydict,vtest],NN=5,NNN=100)
print "\nSearch directly in Array:"
Timeme(myfs,[Flist,vtest],NN=5,NNN=100)
Tests OK: True
Search with Dictionary:
0: 0.000404204885682
1: 0.000409016848607
2: 0.000418640774457
3: 0.000404204885682
4: 0.000394580959833
Search directly in Array:
0: 0.00455211692685
1: 0.00465798011119
2: 0.00458580066732
3: 0.00464354422242
4: 0.00476384329554
Upvotes: 1
Views: 1024
Reputation: 95908
This could partially be explained by the method-call / function-call overhead. Your dictionary-search function merely does a single operation, indexing, which is translated into a call to my_dict.__getitem__(key)
, whereas your array-based implementation ultimately calls 3 methods, .searchsorted
and __getitem__
twice. Python is a dynamic language, function-calls and, especially, method calls (because of method resolution) are expensive.
But fundamentally, your dict
-based implementation should scale better. Python dict
objects are highly optimized hash-maps with constant-time search, usually. Your array based implementation is binary search, so it is O(log(n)). You'll see this with a test-case where you select the worst-case, i.e., searching for an element not in the array. Given that searchsorted
scales logarithmically, you might have to increase the size of your array dramatically (e.g.100x, 1000x) before you see notable runtime effects.
There is absolutely no chance you'll implement a faster look-up than the built-in dict
in Python.
Upvotes: 1