Reputation: 23530

Python dictionary lookup speed with NumPy datatypes

BACKGROUND

I have a lot of numeric message codes in a NumPy array, and I'd need to convert them into strings fast. I have had some problems with the performance and would like to understand why and how to make it quick.

SOME BENCHMARKS

I - The trivial approach

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

The dictionary lookup takes the better part of my coffee break, 758 ms. (I also tried res = map(lookupdict.get, arr) but that's even worse.)

II - Without NumPy

import random

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = [ random.choice(lookupdict.keys()) for _ in range(1000000) ]

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

The timing results change quite considerably to 76 ms!

It should be noted that I am interested in timing the lookup. The random generation is just to create some test data. It is not interesting whether it takes a lot of time or not. All benchmark results given here are only for the one million lookups.

III - Convert NumPy array to a list

My first guess was that this has something to do with list vs. array problems. However, by modifying the NumPy version to use lists:

res = [ lookupdict[k] for k in list(arr) ]

gives me 778 ms, of which around 110 ms is spent converting the list and 570 ms doing the lookup. So, the lookup is a bit faster, but the total time is the same.

IV - Type conversion from np.int32 to int

As the only other difference seems to be the data type (np.int32 vs. int), I tried converting the types on-the-fly. This is a bit stupid, as probably dict does the same:

res = [ lookupdict[int(k)] for k in arr ]

However, this seems to do something interesting, because the time drops to 266 ms. It seems that almost-but-not-quite-the-same datatypes play nasty tricks with dictionary lookups and that dict code is not very efficient with conversions.

V - Dictionary key conversion to np.int32

To test this, I modified the NumPy version to use exactly the same data type in dict keys and lookup:

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     np.int32(1): "val1",
     np.int32(2): "val2",
    np.int32(27): "val3",
    np.int32(35): "val4",
    np.int32(59): "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

This improved to 177 ms. Not an insignificant improvement but a far cry form the 76 ms.

VI - Array conversion to use int objects

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.array([ random.choice(lookupdict.keys()) for _ in range(1000000) ], 
               dtype='object')

# create a list of words looked up
res = [ lookupdict[k] for k in arr ]

This gives 86 ms, which is already very close to the native Python 76 ms.

Result summary

dict keys int, indexing with int (native Python): 76 ms
dict keys int, indexing with int objects (NumPy): 86 ms
dict keys np.int32, indexing with np.int32: 177 ms
dict keys int, indexing with np.int32: 758 ms

QUESTION(S)

Why? And what can I do to make the dictionary lookups as fast as possible? My input data is a NumPy array, so the best (fastest but ugly) this far is to convert the dict keys into np.int32. (Unfortunately, the dict keys may be spread over a wide range of numbers, so indexing array-by-array is not a viable alternative. Fast it would be though, 10 ms.)

Upvotes: 15

Answers (6)

philn

Reputation: 784

This seems to be easily the fastest way to do it when your keys are integers. Just do array indexing.

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)
lookup_array=np.array([None if i not in lookupdict else lookupdict[i] for i in range(60)])
%timeit lookup_array[arr]

5.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

JoeCondron

Reputation: 8906

Here's a solution using Pandas which gives a five-fold improvement:

import numpy as np
import pandas as pd

# dictionary to use as the lookup dictionary
lookupdict = {
 1: "val1",
 2: "val2",
27: "val3",
35: "val4",
59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

# create a list of words looked up
%timeit res = [ lookupdict[k] for k in arr ]
%timeit res_pd = pd.Series(lookupdict).reindex(arr).values
print all(res == res_pd)

10 loops, best of 3: 192 ms per loop
10 loops, best of 3: 35.3 ms per loop
True

This is an average of 35ns per element and so must be impossible to beat in native Python. If you're not familiar with Pandas, the Series object is like an OrderedDict or an indexed array, which can be constructed from a standard Python dict. The reindex method provides very fast look-up; I'm not sure how as I don't really know what goes on under the hood (I'm not a very experienced programmer) but it's probably written in C or Cython. Maybe you could check out the source and come up with an even faster bespoke solution to your problem. Finally, the values attribute simply returns the array underlying the Series.

EDIT: Here's a purely numpy solution which is almost as good as the Pandas:

keys = np.array(lookupdict.keys())
strings = np.array(lookupdict.values())
%timeit res_np = strings[(np.atleast_2d(arr).T == keys).argmax(axis=1)]
10 loops, best of 3: 44.6 ms per loop

print all(res == res_np)
True

Upvotes: 1

Bi Rico

Reputation: 25833

One option you have not considered in your question, though an admittedly limited option that will not be workable in some situations, is to convert your lookupdict to an array. On my machine, with a small dict like your example it is very fast.

import numpy as np

# dictionary to use as the lookup dictionary
lookupdict = {
     1: "val1",
     2: "val2",
    27: "val3",
    35: "val4",
    59: "val5" }

# some test data
arr = np.random.choice(lookupdict.keys(), 1000000)

table = np.empty(max(lookupdict.keys()) + 1, dtype='S4')
for key, value in lookupdict.items():
    table[key] = value

res = table[arr]

Upvotes: 0

hpaulj

Reputation: 231540

In my timings, your II - Without NumPy is quite a bit slower than I

In [11]: timeit [lookupdict[k] for k in np.random.choice(lookupdict.keys(),1000000)]
1 loops, best of 3: 658 ms per loop

In [12]: timeit [lookupdict[k] for k in [np.random.choice(lookupdict.keys()) for _ in range(1000000)]]
1 loops, best of 3: 8.04 s per loop

But if skip the lookup by making the choice on the values, you gain more time

In [34]: timeit np.random.choice(lookupdict.values(),1000000)
10 loops, best of 3: 85.3 ms per loop

OK, lets focus on the lookup:

In [26]: arr =np.random.choice(lookupdict.keys(),1000000)

In [27]: arrlist=arr.tolist()

In [28]: timeit res = [lookupdict[k] for k in arr]
1 loops, best of 3: 583 ms per loop

In [29]: timeit res = [lookupdict[k] for k in arrlist]
10 loops, best of 3: 120 ms per loop

In [30]: timeit res = [lookupdict[k] for k in list(arr)]
1 loops, best of 3: 675 ms per loop

In [31]: timeit res = [lookupdict[k] for k in arr.tolist()]
10 loops, best of 3: 156 ms per loop

In [32]: timeit res = [k for k in arr]
1 loops, best of 3: 215 ms per loop

In [33]: timeit res = [k for k in arrlist]
10 loops, best of 3: 51.4 ms per loop

In [42]: timeit arr.tolist()
10 loops, best of 3: 33.6 ms per loop

In [43]: timeit list(arr)
1 loops, best of 3: 264 ms per loop

First observation - iteration over an np.array is slower than iteration over the equivalent list

Second - list(arr) is slower the arr.tolist(). list() appears to have 2 problems. By itself it is slower, and the items are np.int32.

Upvotes: 4

DrV

Reputation: 23530

This is interesting, I may have found an answer to my question.

The alternative III was to convert the array into a list. It seems that this provides very good results if done the right way. This:

res = [ lookupdict[k] for k in list(arr) ]

clocks 778 ms.

But this:

res = [ lookupdict[k] for k in arr.tolist() ]

clocks 86 ms.

The technical explanation behind this being that arr.tolist converts the array into int objects, whereas list(arr) creates a list of np.int32 objects.

Upvotes: 5

shx2

Reputation: 64328

As you suspected, it's int32.__hash__ whose at fault here, being x11 as slow as int.__hash__:

%timeit hash(5)
10000000 loops, best of 3: 39.2 ns per loop
%timeit hash(np.int32(5))
1000000 loops, best of 3: 444 ns per loop

(the int32 type is implemented in C. If you're really curios, you can dig in the source code and find out what it's doing there which takes so long).

EDIT:

A second part which slows things down is the implicit == comparison on dict lookup:

a = np.int32(5)
b = np.int32(5)
%timeit a == b  # comparing two int32's
10000000 loops, best of 3: 61.9 ns per loop
%timeit a == 5  # comparing int32 against int -- much slower
100000 loops, best of 3: 2.62 us per loop

This explains why your V is so much faster than I and IV. Of course, sticking with an all-int solution would be faster.

So as I see it, you have two options:

stick with the pure int type, or convert to int before the dict-lookup
if the biggest code value is not too big, and/or memory is not a problem, you can trade dict-lookups for list-indexing, which do not require hashing.

E.g.:

lookuplist = [None] * (max(lookupdict.keys()) + 1)
for k,v in lookupdict.items():
    lookuplist[k] = v

res = [ lookuplist[k] for k in arr ] # using list indexing

(EDIT: you might also want to experiment with np.choose here)

Upvotes: 5

Python dictionary lookup speed with NumPy datatypes

Answers (6)

Related Questions