dofine
dofine

Reputation: 883

How to count values greater than the group mean in Pandas?

The dataframe is generated as follows. What's the best way to count how many values are greater than the average of grouped by ["A", "B"]?

import numpy as np
import pandas as pd 
keys = np.array([['A', 'B'], ['A', 'B'], ['A', 'B'],
                 ['A', 'B'], ['C', 'D'], ['C', 'D'],
                 ['C', 'D'], ['E', 'F'], ['E', 'F'],
                 ['G', 'H']]) 
df = (pd.DataFrame(np.hstack([keys,np.random.randn(10,4).round(2)]),
                  columns = ['col1', 'col2', 'col3',
                             'col4', 'col5', 'col6'])
      .astype({'col3': float,
               'col4': float,
               'col5': float, 
               'col6': float}))

I tried to first count the mean and then join the grouped-dataframe back to the original one, then do a sum(1). But that appears to be a little tedious.

df2 = pd.merge(df.groupby(["col1", "col2"]).mean(), df, left_on=["col1", "col2"], right_on=["col1", "col2"])

Upvotes: 0

Views: 3168

Answers (3)

Space Impact
Space Impact

Reputation: 13255

You can do:

(df[['col3', 'col4', 'col5', 'col6']]>df.groupby(['col1','col2']).transform('mean')).sum()

col3    4
col4    6
col5    3
col6    6
dtype: int64

Upvotes: 2

CT Zhu
CT Zhu

Reputation: 54380

You would need the agg method

In [28]: df.groupby(['col1', 'col2']).agg(lambda x: (x > x.mean()).sum())
Out[28]:
           col3  col4  col5  col6
col1 col2
A    B      1.0   2.0   2.0   2.0
C    D      2.0   2.0   2.0   2.0
E    F      1.0   1.0   1.0   1.0
G    H      0.0   0.0   0.0   0.0

In essence, the x is going to be array-like. x > x.mean() gives True if the element is larger than the mean and 0 otherwise, sum then counts the number of Trues.

Upvotes: 2

piRSquared
piRSquared

Reputation: 294508

numpy.add.at and pandas.factorize

cols = ['col1', 'col2']
i, r = pd.factorize([*zip(*map(df.get, cols))])

v = df.drop(cols, 1).values
n, m = shape = len(r), v.shape[1]

b = np.zeros(shape)
c = np.zeros(shape)
d = np.zeros(shape, np.int64)

i0, i1 = i.repeat(m), np.tile(np.arange(m), len(v))

np.add.at(b, (i0, i1), v.ravel())
np.add.at(c, (i0, i1), 1)
np.add.at(d, (i0, i1), (v > (b / c)[i]).ravel())

pd.DataFrame(
    d, pd.MultiIndex.from_tuples(r, names=cols),
    df.columns.difference(cols)
)

           col3  col4  col5  col6
col1 col2                        
A    B        2     2     3     2
C    D        2     1     1     2
E    F        1     1     1     1
G    H        0     0     0     0

Upvotes: 1

Related Questions