D R
D R

Reputation: 22510

Fast replacement of values in a numpy array

I have a very large numpy array (containing up to a million elements) like the one below:

[0,1,6,5,1,2,7,6,2,3,8,7,3,4,9,8,5,6,11,10,6,7,12,11,7,
8,13,12,8,9,14,13,10,11,16,15,11,12,17,16,12,13,18,17,13,
14,19,18,15,16,21,20,16,17,22,21,17,18,23,22,18,19,24,23]

and a small dictionary map for replacing some of the elements in the above array

{4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?

Upvotes: 68

Views: 70746

Answers (11)

Mateen Ulhaq
Mateen Ulhaq

Reputation: 27271

Potential speedup using np.unique

Use np.unique to collapse an array into a smaller array values, then remap that smaller array, and then expand back into the original array.

def np_remap(arr, d):
    values, inverse = np.unique(arr, return_inverse=True)
    values = np.array([d[x] for x in values])  # Any remapping method works.
    return values[inverse].reshape(arr.shape)

Recommended for use when:

  • max(d.keys()) is large.
  • len(d) is much smaller than max(d.keys()).
  • Under certain conditions, this has lower time complexity than the np.searchsorted method.

Test:

>>> d = {111: 1010, 222: 2020, 333: 3030}
>>> arr = np.array([[333, 111], [111, 222], [111, 333]])
>>> np_remap(arr, d)
array([[3030, 1010],
       [1010, 2020],
       [1010, 3030]])

Upvotes: 0

Nils Werner
Nils Werner

Reputation: 36859

A fully vectorized solution using np.in1d and np.searchsorted:

replace = numpy.array([list(replace.keys()), list(replace.values())])    # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :])                                   # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])]   # Replace elements

Upvotes: 4

Jean Lescut
Jean Lescut

Reputation: 1467

I benchmarked some solutions, and the result is without appeal :

import timeit
import numpy as np

array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))

def method_for_loop():
    out = array.copy()
    for from_value, to_value in zip(from_values, to_values) :
        out[out == from_value] = to_value
    print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))

def method_list_comprehension():
    out = [d[i] for i in array]
    print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))

def method_bruteforce():
    idx = np.nonzero(from_values == array[:,None])[1]
    out = to_values[idx]
    print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))

def method_searchsort():
    sort_idx = np.argsort(from_values)
    idx = np.searchsorted(from_values,array,sorter = sort_idx)
    out = to_values[sort_idx][idx]
    print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))

And I got the following results :

Check method_for_loop : True
Time method_for_loop : 2.6411612760275602

Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879

Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692

Check method_searchsort : True
Time method_searchsort : 0.03770717792212963

The "searchsort" method is almost a hundred times faster than the "for" loop, and about 3600 times faster than the numpy bruteforce method. The list comprehension method is also a very good trade-off between code simplicity and speed.

Upvotes: 15

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10769

The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:

import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))

The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.

Upvotes: 5

caiohamamura
caiohamamura

Reputation: 2758

Pythonic way without the need for data to be integer, can be even strings:

from scipy.stats import rankdata
import numpy as np

data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}

arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]

unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]

Upvotes: 0

Pietro Battiston
Pietro Battiston

Reputation: 8390

No solution was posted still without a python loop on the array (except Celil's one, which however assume numbers are "small"), so here is an alternative:

def replace(arr, rep_dict):
    """Assumes all elements of "arr" are keys of rep_dict"""

    # Removing the explicit "list" breaks python3
    rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))

    idces = digitize(arr, rep_keys, right=True)
    # Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr

    return rep_vals[idces]

the way "idces" is created comes from here.

Upvotes: 4

Speckinius Flecksis
Speckinius Flecksis

Reputation: 624

Another more general way to achieve this is function vectorization:

import numpy as np

data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

def mp(entry):
    return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)

print mp(data)

Upvotes: 10

Katriel
Katriel

Reputation: 123782

Well, you need to make one pass through theArray, and for each element replace it if it is in the dictionary.

for i in xrange( len( theArray ) ):
    if foo[ i ] in dict:
        foo[ i ] = dict[ foo[ i ] ]

Upvotes: 0

kennytm
kennytm

Reputation: 523784

I believe there's even more efficient method, but for now, try

from numpy import copy

newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v

Microbenchmark and test for correctness:

#!/usr/bin/env python2.7

from numpy import copy, random, arange

random.seed(0)
data = random.randint(30, size=10**5)

d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()

def f1(a, d):
    b = copy(a)
    for k, v in d.iteritems():
        b[a==k] = v
    return b

def f2(a, d):
    for i in xrange(len(a)):
        a[i] = d.get(a[i], a[i])
    return a

def f3(a, dk, dv):
    mp = arange(0, max(a)+1)
    mp[dk] = dv
    return mp[a]


a = copy(data)
res = f2(a, d)

assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()

Result:

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop

$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop

Upvotes: 49

D R
D R

Reputation: 22510

Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int dict, like below

mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]

where first

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  4  9  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
 21 20 16 17 22 21 17 18 23 22 18 19 24 23]

and replacing with

replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}

we obtain

data = [ 0  1  6  5  1  2  7  6  2  3  8  7  3  0  5  8  5  6 11 10  6  7 12 11  7
  8 13 12  8  5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
  1  0 16 17  2  1 17 18  3  2 18 15  0  3]

Upvotes: 29

John La Rooy
John La Rooy

Reputation: 304493

for i in xrange(len(the_array)):
    the_array[i] = the_dict.get(the_array[i], the_array[i])

Upvotes: 0

Related Questions