Reputation: 883

How to count values greater than the group mean in Pandas?

The dataframe is generated as follows. What's the best way to count how many values are greater than the average of grouped by ["A", "B"]?

import numpy as np
import pandas as pd 
keys = np.array([['A', 'B'], ['A', 'B'], ['A', 'B'],
                 ['A', 'B'], ['C', 'D'], ['C', 'D'],
                 ['C', 'D'], ['E', 'F'], ['E', 'F'],
                 ['G', 'H']]) 
df = (pd.DataFrame(np.hstack([keys,np.random.randn(10,4).round(2)]),
                  columns = ['col1', 'col2', 'col3',
                             'col4', 'col5', 'col6'])
      .astype({'col3': float,
               'col4': float,
               'col5': float, 
               'col6': float}))

I tried to first count the mean and then join the grouped-dataframe back to the original one, then do a sum(1). But that appears to be a little tedious.

df2 = pd.merge(df.groupby(["col1", "col2"]).mean(), df, left_on=["col1", "col2"], right_on=["col1", "col2"])

Upvotes: 0

Answers (3)

Space Impact

Reputation: 13255

You can do:

(df[['col3', 'col4', 'col5', 'col6']]>df.groupby(['col1','col2']).transform('mean')).sum()

col3    4
col4    6
col5    3
col6    6
dtype: int64

Upvotes: 2

CT Zhu

Reputation: 54380

You would need the agg method

In [28]: df.groupby(['col1', 'col2']).agg(lambda x: (x > x.mean()).sum())
Out[28]:
           col3  col4  col5  col6
col1 col2
A    B      1.0   2.0   2.0   2.0
C    D      2.0   2.0   2.0   2.0
E    F      1.0   1.0   1.0   1.0
G    H      0.0   0.0   0.0   0.0

In essence, the x is going to be array-like. x > x.mean() gives True if the element is larger than the mean and 0 otherwise, sum then counts the number of Trues.

Upvotes: 2

piRSquared

Reputation: 294508

`numpy.add.at` and `pandas.factorize`

cols = ['col1', 'col2']
i, r = pd.factorize([*zip(*map(df.get, cols))])

v = df.drop(cols, 1).values
n, m = shape = len(r), v.shape[1]

b = np.zeros(shape)
c = np.zeros(shape)
d = np.zeros(shape, np.int64)

i0, i1 = i.repeat(m), np.tile(np.arange(m), len(v))

np.add.at(b, (i0, i1), v.ravel())
np.add.at(c, (i0, i1), 1)
np.add.at(d, (i0, i1), (v > (b / c)[i]).ravel())

pd.DataFrame(
    d, pd.MultiIndex.from_tuples(r, names=cols),
    df.columns.difference(cols)
)

           col3  col4  col5  col6
col1 col2                        
A    B        2     2     3     2
C    D        2     1     1     2
E    F        1     1     1     1
G    H        0     0     0     0

Upvotes: 1

How to count values greater than the group mean in Pandas?

Answers (3)

numpy.add.at and pandas.factorize

Related Questions

`numpy.add.at` and `pandas.factorize`