cs95
cs95

Reputation: 402483

Unprecedented TypeError using groupby and mean of vectors in Pandas

This is a sample data frame:

df = pd.DataFrame({'Cat' : ['a', 'a', 'b'], 'Vec' : [[1, 2, 3], [4, 5, 6], [1, 2, 3]]})

print (df)
  Cat        Vec
0   a  [1, 2, 3]
1   a  [4, 5, 6]
2   b  [1, 2, 3]

My goal is to groupby Cat and take the mean of these vectors along the 0th axis:

                 Vec
Cat                 
a    [2.5, 3.5, 4.5]
b    [1.0, 2.0, 3.0]

The first and obvious solution seemed to be:

df.groupby('Cat').Vec.apply(np.mean)

But this gives me:

TypeError: Could not convert [1, 2, 3, 4, 5, 6] to numeric

However, this works:

df.groupby('Cat').Vec.apply(lambda x: np.mean(x.tolist(), axis=0))

Also, this same technique works to good effect in this answer: https://stackoverflow.com/a/45726608/4909087

It seems a bit roundabout. Why does the error occur with the first method and how do I fix that?

Upvotes: 3

Views: 227

Answers (1)

piRSquared
piRSquared

Reputation: 294258

df = pd.DataFrame({
    'Cat': ['a', 'a', 'b'],
    'Vec': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([1, 2, 3])]
})


df.groupby('Cat').Vec.apply(np.mean)

Cat
a    [2.5, 3.5, 4.5]
b    [1.0, 2.0, 3.0]
Name: Vec, dtype: object

df = pd.DataFrame({
    'Cat': ['a', 'a', 'b'],
    'Vec': [[1, 2, 3], [4, 5, 6], [1, 2, 3]]
})

df.Vec.apply(np.array).groupby(df.Cat).apply(np.mean)

Cat
a    [2.5, 3.5, 4.5]
b    [1.0, 2.0, 3.0]
Name: Vec, dtype: object

The issue is that np.mean can take a list of lists, but not an array of lists.

See these examples

np.mean(df.loc[df.Cat.eq('a'), 'Vec'].values, 0)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-380-279352aca85f> in <module>()
----> 1 np.mean(df.loc[df.Cat.eq('a'), 'Vec'].values, 0)

//anaconda/envs/3.6/lib/python3.6/site-packages/numpy/core/fromnumeric.py in mean(a, axis, dtype, out, keepdims)
   2907 
   2908     return _methods._mean(a, axis=axis, dtype=dtype,
-> 2909                           out=out, **kwargs)
   2910 
   2911 

//anaconda/envs/3.6/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
     80             ret = ret.dtype.type(ret / rcount)
     81     else:
---> 82         ret = ret / rcount
     83 
     84     return ret

TypeError: unsupported operand type(s) for /: 'list' and 'int'

np.mean(df.loc[df.Cat.eq('a'), 'Vec'].values.tolist(), 0)

array([ 2.5,  3.5,  4.5])

Upvotes: 3

Related Questions