using pandas dataframe to set indices in numpy array

Question

I have a pandas dataframe with indices to a numpy array. The value of the array has to be set to 1 for those indices. I need to do this millions of times on a big numpy array. Is there a more efficient way than the approach shown below?

from numpy import float32, uint
from numpy.random import choice
from pandas import DataFrame
from timeit import timeit

xy = 2000,300000
sz = 10000000
ind = DataFrame({"i":choice(range(xy[0]),sz),"j":choice(range(xy[1]),sz)}).drop_duplicates()
dtype = uint
repeats = 10

#original (~21s)
stmt = '''\
from numpy import zeros
a = zeros(xy, dtype=dtype)
a[ind.values[:,0],ind.values[:,1]] = 1'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

#suggested by @piRSquared (~13s)
stmt = '''\
from numpy import ones
from scipy.sparse import coo_matrix
i,j = ind.i.values,ind.j.values
a = coo_matrix((ones(i.size, dtype=dtype), (i, j)), dtype=dtype).toarray()
'''

print(timeit(stmt, "from __main__ import xy,sz,ind,dtype", number=repeats))

I have edited the above post to show the approach(es) suggested by @piRSquared and re-wrote it to allow an apples-to-apples comparison. Irrespective of the data type (tried uint and float32), the suggested approach has a 40% reduction in time.

using pandas dataframe to set indices in numpy array

Answers (1)

Related Questions