Peter
Peter

Reputation: 620

Pandas DataFrame with X,Y coordinates to NumPy matrix

I have a DataFrame with columns X, Y and value, e.g.:

   X |   Y | value
------------------
   1 |   1 |    56
   2 |   1 |    13
   3 |   1 |    25
 ... | ... |   ...
   1 |   2 |     7
   2 |   2 |    18
 ... | ... |   ...
   1 | 123 |    91
 ... | ... |   ...
  50 | 123 |    32

I need to convert this to DataFrame to a NumPy matrix:

[[56, 13, 25, ...],
 [ 7, 18,     ...],
 ...,
 [ 91, ...   , 32]]

I know I can iterate over each cell of the DataFrame, but that is too slow. What is the effective way of doing this?

Also note: values for some coordinates in DataFrame are missing

Upvotes: 4

Views: 9274

Answers (3)

jeremycg
jeremycg

Reputation: 24945

I would do this by going through a sparse coordinate matrix, which is basically the format you have.

NB, missing spots will be stored as 0s if you convert to an array.

If you have a ton missing, it might be better to stick to a sparse matrix for memory or performance reasons depending on your downstream processes.

x = pd.DataFrame({'X':[1,2,3,1,2,1,4], 'Y':[1,1,1,2,2,3,3], 'Z':[56,13,25,7,18,91,32]})

#import coo from sparse
from scipy.sparse import coo_matrix
#it works like (data,(y,x))
out = coo_matrix((x.Z,(x.Y-1,x.X-1))) #-1, as you aren't 0 indexed above
#if you really don't want sparse turn it to an array:
out.toarray()
array([[56, 13, 25,  0],
       [ 7, 18,  0,  0],
       [91,  0,  0, 32]], dtype=int64)

Upvotes: 4

Zero
Zero

Reputation: 76917

Using set_index

In [501]: df.set_index(['Y', 'X']).unstack().values
Out[501]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, Using groupby

In [493]: df.groupby(['Y', 'X'])['value'].sum().unstack().values
Out[493]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, Using crosstab

In [500]: pd.crosstab(index=df.Y, columns=df.X, values=df.value, aggfunc='sum').values
Out[500]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, using pd.pivot_table as pointed in another answer.

Upvotes: 4

akuiper
akuiper

Reputation: 214927

Pivot the data frame and the values should be what you need:

df.pivot('Y', 'X', 'value').values

#array([[ 56.,  13.,  25.,  nan],
#       [  7.,  18.,  nan,  nan],
#       [ 91.,  nan,  nan,  32.]])

Upvotes: 11

Related Questions