user3058134
user3058134

Reputation: 21

python pandas aggregating non numeric types

I have the following problem, I managed to create a data frame with object dtypes on some columns. In particular these would be 2d numpy arrays but they could be any non-numeric type. Now I want to pivot my dataframe. Is there a way to pass an aggregating function of my choice which works on these objects? I don't seem to be able to do it and I get the error:

GroupByError: No numeric types to aggregate

For example, say I have this dummy data frame:

date foo  bar               mat
1     a   x      [[1, 2], [3, 4]]
1     b   x      [[1, 2], [3, 4]]
1     a   y      [[1, 2], [3, 4]]
1     b   y      [[1, 2], [3, 4]]
2     a   x      [[4, 5], [6, 7]]
2     b   x      [[4, 5], [6, 7]]
2     a   y      [[4, 5], [6, 7]]
2     b   y      [[4, 5], [6, 7]]

and I want to have a new data frame of the type:

dd.pivot_table(values=['mat'], rows=['date'], cols=['foo'], aggfunc= ??)

where my 2-d arrays will be an element-by-element sum of the arrays with same value in the 'foo' columns. How can I do that? If not possible, is it possible to pick the first occurrence of the 'mat' element in the list of arrays with same 'foo'? Thanks

added the desired output:

date    a               b           
1    [[2, 4], [6, 8]]  [[8, 10], [12, 14]]
2    [[2, 4], [6, 8]]  [[8, 10], [12, 14]]

Upvotes: 1

Views: 1864

Answers (1)

roman
roman

Reputation: 117626

You can group first and then pivot:

>>> grouped = df.groupby(('foo', 'date'))
>>> g = grouped['mat'].apply(lambda x: np.array(map(np.array, x.values)).T.sum(axis=2).T).reset_index()
>>> g
  foo  date                    0
0   a     1     [[2, 4], [6, 8]]
1   a     2  [[8, 10], [12, 14]]
2   b     1     [[2, 4], [6, 8]]
3   b     2  [[8, 10], [12, 14]]
>>> g.pivot(columns='foo', values=0, index='date').reset_index()
foo  date                    a                    b
0       1     [[2, 4], [6, 8]]     [[2, 4], [6, 8]]
1       2  [[8, 10], [12, 14]]  [[8, 10], [12, 14]]

To sum by elements I've used numpy sum over axis=2 (converted lists into np.array beforehand). Also, seeems that your output is a bit incorrect - it should be:

date    a               b           
1    [[2, 4], [6, 8]]  [[8, 10], [12, 14]]
2    [[2, 4], [6, 8]]  [[8, 10], [12, 14]]

Upvotes: 1

Related Questions