Reputation: 23530
BACKGROUND
I have a lot of numeric message codes in a NumPy array, and I'd need to convert them into strings fast. I have had some problems with the performance and would like to understand why and how to make it quick.
SOME BENCHMARKS
I - The trivial approach
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
The dictionary lookup takes the better part of my coffee break, 758 ms. (I also tried res = map(lookupdict.get, arr)
but that's even worse.)
II - Without NumPy
import random
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = [ random.choice(lookupdict.keys()) for _ in range(1000000) ]
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
The timing results change quite considerably to 76 ms!
It should be noted that I am interested in timing the lookup. The random generation is just to create some test data. It is not interesting whether it takes a lot of time or not. All benchmark results given here are only for the one million lookups.
III - Convert NumPy array to a list
My first guess was that this has something to do with list vs. array problems. However, by modifying the NumPy version to use lists:
res = [ lookupdict[k] for k in list(arr) ]
gives me 778 ms, of which around 110 ms is spent converting the list and 570 ms doing the lookup. So, the lookup is a bit faster, but the total time is the same.
IV - Type conversion from np.int32
to int
As the only other difference seems to be the data type (np.int32
vs. int
), I tried converting the types on-the-fly. This is a bit stupid, as probably dict does the same:
res = [ lookupdict[int(k)] for k in arr ]
However, this seems to do something interesting, because the time drops to 266 ms. It seems that almost-but-not-quite-the-same datatypes play nasty tricks with dictionary lookups and that dict code is not very efficient with conversions.
V - Dictionary key conversion to np.int32
To test this, I modified the NumPy version to use exactly the same data type in dict keys and lookup:
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
np.int32(1): "val1",
np.int32(2): "val2",
np.int32(27): "val3",
np.int32(35): "val4",
np.int32(59): "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
This improved to 177 ms. Not an insignificant improvement but a far cry form the 76 ms.
VI - Array conversion to use int
objects
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.array([ random.choice(lookupdict.keys()) for _ in range(1000000) ],
dtype='object')
# create a list of words looked up
res = [ lookupdict[k] for k in arr ]
This gives 86 ms, which is already very close to the native Python 76 ms.
Result summary
int
, indexing with int
(native Python): 76 msint
, indexing with int
objects (NumPy): 86 msnp.int32
, indexing with np.int32
: 177 msint
, indexing with np.int32
: 758 msQUESTION(S)
Why? And what can I do to make the dictionary lookups as fast as possible? My input data is a NumPy array, so the best (fastest but ugly) this far is to convert the dict keys into np.int32
. (Unfortunately, the dict keys may be spread over a wide range of numbers, so indexing array-by-array is not a viable alternative. Fast it would be though, 10 ms.)
Upvotes: 15
Views: 11879
Reputation: 784
This seems to be easily the fastest way to do it when your keys are integers. Just do array indexing.
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
lookup_array=np.array([None if i not in lookupdict else lookupdict[i] for i in range(60)])
%timeit lookup_array[arr]
5.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 8906
Here's a solution using Pandas which gives a five-fold improvement:
import numpy as np
import pandas as pd
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
# create a list of words looked up
%timeit res = [ lookupdict[k] for k in arr ]
%timeit res_pd = pd.Series(lookupdict).reindex(arr).values
print all(res == res_pd)
10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 35.3 ms per loop
True
This is an average of 35ns per element and so must be impossible to beat in native Python. If you're not familiar with Pandas, the Series object is like an OrderedDict or an indexed array, which can be constructed from a standard Python dict. The reindex
method provides very fast look-up; I'm not sure how as I don't really know what goes on under the hood (I'm not a very experienced programmer) but it's probably written in C or Cython. Maybe you could check out the source and come up with an even faster bespoke solution to your problem. Finally, the values attribute simply returns the array underlying the Series.
EDIT: Here's a purely numpy solution which is almost as good as the Pandas:
keys = np.array(lookupdict.keys())
strings = np.array(lookupdict.values())
%timeit res_np = strings[(np.atleast_2d(arr).T == keys).argmax(axis=1)]
10 loops, best of 3: 44.6 ms per loop
print all(res == res_np)
True
Upvotes: 1
Reputation: 25833
One option you have not considered in your question, though an admittedly limited option that will not be workable in some situations, is to convert your lookupdict to an array. On my machine, with a small dict like your example it is very fast.
import numpy as np
# dictionary to use as the lookup dictionary
lookupdict = {
1: "val1",
2: "val2",
27: "val3",
35: "val4",
59: "val5" }
# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
table = np.empty(max(lookupdict.keys()) + 1, dtype='S4')
for key, value in lookupdict.items():
table[key] = value
res = table[arr]
Upvotes: 0
Reputation: 231540
In my timings, your II - Without NumPy
is quite a bit slower than I
In [11]: timeit [lookupdict[k] for k in np.random.choice(lookupdict.keys(),1000000)]
1 loops, best of 3: 658 ms per loop
In [12]: timeit [lookupdict[k] for k in [np.random.choice(lookupdict.keys()) for _ in range(1000000)]]
1 loops, best of 3: 8.04 s per loop
But if skip the lookup by making the choice
on the values, you gain more time
In [34]: timeit np.random.choice(lookupdict.values(),1000000)
10 loops, best of 3: 85.3 ms per loop
OK, lets focus on the lookup:
In [26]: arr =np.random.choice(lookupdict.keys(),1000000)
In [27]: arrlist=arr.tolist()
In [28]: timeit res = [lookupdict[k] for k in arr]
1 loops, best of 3: 583 ms per loop
In [29]: timeit res = [lookupdict[k] for k in arrlist]
10 loops, best of 3: 120 ms per loop
In [30]: timeit res = [lookupdict[k] for k in list(arr)]
1 loops, best of 3: 675 ms per loop
In [31]: timeit res = [lookupdict[k] for k in arr.tolist()]
10 loops, best of 3: 156 ms per loop
In [32]: timeit res = [k for k in arr]
1 loops, best of 3: 215 ms per loop
In [33]: timeit res = [k for k in arrlist]
10 loops, best of 3: 51.4 ms per loop
In [42]: timeit arr.tolist()
10 loops, best of 3: 33.6 ms per loop
In [43]: timeit list(arr)
1 loops, best of 3: 264 ms per loop
First observation - iteration over an np.array
is slower than iteration over the equivalent list
Second - list(arr)
is slower the arr.tolist()
. list()
appears to have 2 problems. By itself it is slower, and the items are np.int32
.
Upvotes: 4
Reputation: 23530
This is interesting, I may have found an answer to my question.
The alternative III was to convert the array into a list. It seems that this provides very good results if done the right way. This:
res = [ lookupdict[k] for k in list(arr) ]
clocks 778 ms.
But this:
res = [ lookupdict[k] for k in arr.tolist() ]
clocks 86 ms.
The technical explanation behind this being that arr.tolist
converts the array into int
objects, whereas list(arr)
creates a list of np.int32
objects.
Upvotes: 5
Reputation: 64328
As you suspected, it's int32.__hash__
whose at fault here, being x11 as slow as int.__hash__
:
%timeit hash(5)
10000000 loops, best of 3: 39.2 ns per loop
%timeit hash(np.int32(5))
1000000 loops, best of 3: 444 ns per loop
(the int32
type is implemented in C. If you're really curios, you can dig in the source code and find out what it's doing there which takes so long).
EDIT:
A second part which slows things down is the implicit ==
comparison on dict lookup:
a = np.int32(5)
b = np.int32(5)
%timeit a == b # comparing two int32's
10000000 loops, best of 3: 61.9 ns per loop
%timeit a == 5 # comparing int32 against int -- much slower
100000 loops, best of 3: 2.62 us per loop
This explains why your V is so much faster than I and IV. Of course, sticking with an all-int
solution would be faster.
So as I see it, you have two options:
int
type, or convert to int before the dict-lookuphash
ing.E.g.:
lookuplist = [None] * (max(lookupdict.keys()) + 1)
for k,v in lookupdict.items():
lookuplist[k] = v
res = [ lookuplist[k] for k in arr ] # using list indexing
(EDIT: you might also want to experiment with np.choose
here)
Upvotes: 5