Reputation: 33
I have a dataframe df containing the columns x, y (both starting at 0) and several value columns. The x and y coordinates are not complete, meaning many x-y combinations, and sometimes complete x or y values are missing. I would like to create a 2-d numpy array with the complete matrix of shape (df.x.max() + 1, (df.y.max()+1)), and missing values replaced with np.nan. pd.pivot comes already quite close, but does not fill completely missing x/y values.
The following code already achieves what is needed, but due to the for loop, this is rather slow:
img = np.full((df.x.max() + 1, df.y.max() +1 ), np.nan)
col = 'value'
for ind, line in df.iterrows():
img[line.x, line.y] = line[value]
A significantly faster version goes as follows:
ind = pd.MultiIndex.from_product((range(df.x.max() + 1), range(df.y.max() +1 )), names=['x', 'y'])
s_img = pd.Series([np.nan]*len(ind), index=ind, name='value')
temp = df.loc[readout].set_index(['x', 'y'])['value']
s_img.loc[temp.index] = temp
img = s_img.unstack().values
The question is whether a vectorized method exists which might make the code shorter and faster.
Thanks for any hints in advance!
Upvotes: 2
Views: 759
Reputation: 879849
Often the fastest way to populate a NumPy array is simply to allocate an array and then assign values
to it using a vectorized operator or function. In this case, np.put
seems ideal since it allows you to assign values using a (flat) array of indices and an array of values.
nrows, ncols = df['x'].max() + 1, df['y'].max() +1
img = np.full((nrows, ncols), np.nan)
ind = df['x']*ncols + df['y']
np.put(img, ind, df['value'])
Here is a benchmark which shows using np.put
can be 82x faster than alt
(the unstack
ing method)
for making a (100, 100)-shaped resultant array:
In [184]: df = make_df(100,100)
In [185]: %timeit orig(df)
161 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [186]: %timeit alt(df)
31.2 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit using_put(df)
378 µs ± 1.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [188]: 31200/378
Out[188]: 82.53968253968254
This is the setup used for the benchmark:
import numpy as np
import pandas as pd
def make_df(nrows, ncols):
df = pd.DataFrame(np.arange(nrows*ncols).reshape(nrows, ncols))
df.index.name = 'x'
df.columns.name = 'y'
ind_x = np.random.choice(np.arange(nrows), replace=False, size=nrows//2)
ind_y = np.random.choice(np.arange(ncols), replace=False, size=ncols//2)
df = df.drop(ind_x, axis=0).drop(ind_y, axis=1).stack().reset_index().rename(columns={0:'value'})
return df
def orig(df):
img = np.full((df.x.max() + 1, df.y.max() +1 ), np.nan)
col = 'value'
for ind, line in df.iterrows():
img[line.x, line.y] = line['value']
return img
def alt(df):
ind = pd.MultiIndex.from_product((range(df.x.max() + 1), range(df.y.max() +1 )), names=['x', 'y'])
s_img = pd.Series([np.nan]*len(ind), index=ind, name='value')
# temp = df.loc[readout].set_index(['x', 'y'])['value']
temp = df.set_index(['x', 'y'])['value']
s_img.loc[temp.index] = temp
img = s_img.unstack().values
return img
def using_put(df):
nrows, ncols = df['x'].max() + 1, df['y'].max() +1
img = np.full((nrows, ncols), np.nan)
ind = df['x']*ncols + df['y']
np.put(img, ind, df['value'])
return img
Alternatively, since your DataFrame is sparse, you might be interested in creating a sparse matrix:
import scipy.sparse as sparse
def using_coo(df):
nrows, ncols = df['x'].max() + 1, df['y'].max() +1
result = sparse.coo_matrix(
(df['value'], (df['x'], df['y'])), shape=(nrows, ncols), dtype='float64')
return result
As one would expect, making sparse matrices (from sparse data) is even faster (and requires less memory) than creating dense NumPy arrays:
In [237]: df = make_df(100,100)
In [238]: %timeit using_put(df)
381 µs ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [239]: %timeit using_coo(df)
196 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [240]: 381/196
Out[240]: 1.9438775510204083
Upvotes: 3