Reputation: 620
I have a DataFrame with columns X
, Y
and value
, e.g.:
X | Y | value
------------------
1 | 1 | 56
2 | 1 | 13
3 | 1 | 25
... | ... | ...
1 | 2 | 7
2 | 2 | 18
... | ... | ...
1 | 123 | 91
... | ... | ...
50 | 123 | 32
I need to convert this to DataFrame to a NumPy matrix:
[[56, 13, 25, ...],
[ 7, 18, ...],
...,
[ 91, ... , 32]]
I know I can iterate over each cell of the DataFrame, but that is too slow. What is the effective way of doing this?
Also note: values for some coordinates in DataFrame are missing
Upvotes: 4
Views: 9274
Reputation: 24945
I would do this by going through a sparse coordinate matrix, which is basically the format you have.
NB, missing spots will be stored as 0s if you convert to an array.
If you have a ton missing, it might be better to stick to a sparse matrix for memory or performance reasons depending on your downstream processes.
x = pd.DataFrame({'X':[1,2,3,1,2,1,4], 'Y':[1,1,1,2,2,3,3], 'Z':[56,13,25,7,18,91,32]})
#import coo from sparse
from scipy.sparse import coo_matrix
#it works like (data,(y,x))
out = coo_matrix((x.Z,(x.Y-1,x.X-1))) #-1, as you aren't 0 indexed above
#if you really don't want sparse turn it to an array:
out.toarray()
array([[56, 13, 25, 0],
[ 7, 18, 0, 0],
[91, 0, 0, 32]], dtype=int64)
Upvotes: 4
Reputation: 76917
Using set_index
In [501]: df.set_index(['Y', 'X']).unstack().values
Out[501]:
array([[ 56., 13., 25., nan],
[ 7., 18., nan, nan],
[ 91., nan, nan, 32.]])
Or, Using groupby
In [493]: df.groupby(['Y', 'X'])['value'].sum().unstack().values
Out[493]:
array([[ 56., 13., 25., nan],
[ 7., 18., nan, nan],
[ 91., nan, nan, 32.]])
Or, Using crosstab
In [500]: pd.crosstab(index=df.Y, columns=df.X, values=df.value, aggfunc='sum').values
Out[500]:
array([[ 56., 13., 25., nan],
[ 7., 18., nan, nan],
[ 91., nan, nan, 32.]])
Or, using pd.pivot_table
as pointed in another answer.
Upvotes: 4