Reputation: 883
The dataframe is generated as follows. What's the best way to count how many values are greater than the average of grouped by ["A", "B"]?
import numpy as np
import pandas as pd
keys = np.array([['A', 'B'], ['A', 'B'], ['A', 'B'],
['A', 'B'], ['C', 'D'], ['C', 'D'],
['C', 'D'], ['E', 'F'], ['E', 'F'],
['G', 'H']])
df = (pd.DataFrame(np.hstack([keys,np.random.randn(10,4).round(2)]),
columns = ['col1', 'col2', 'col3',
'col4', 'col5', 'col6'])
.astype({'col3': float,
'col4': float,
'col5': float,
'col6': float}))
I tried to first count the mean and then join the grouped-dataframe back to the original one, then do a sum(1). But that appears to be a little tedious.
df2 = pd.merge(df.groupby(["col1", "col2"]).mean(), df, left_on=["col1", "col2"], right_on=["col1", "col2"])
Upvotes: 0
Views: 3168
Reputation: 13255
You can do:
(df[['col3', 'col4', 'col5', 'col6']]>df.groupby(['col1','col2']).transform('mean')).sum()
col3 4
col4 6
col5 3
col6 6
dtype: int64
Upvotes: 2
Reputation: 54380
You would need the agg
method
In [28]: df.groupby(['col1', 'col2']).agg(lambda x: (x > x.mean()).sum())
Out[28]:
col3 col4 col5 col6
col1 col2
A B 1.0 2.0 2.0 2.0
C D 2.0 2.0 2.0 2.0
E F 1.0 1.0 1.0 1.0
G H 0.0 0.0 0.0 0.0
In essence, the x
is going to be array-like. x > x.mean()
gives True if the element is larger than the mean and 0 otherwise, sum
then counts the number of Trues.
Upvotes: 2
Reputation: 294508
numpy.add.at
and pandas.factorize
cols = ['col1', 'col2']
i, r = pd.factorize([*zip(*map(df.get, cols))])
v = df.drop(cols, 1).values
n, m = shape = len(r), v.shape[1]
b = np.zeros(shape)
c = np.zeros(shape)
d = np.zeros(shape, np.int64)
i0, i1 = i.repeat(m), np.tile(np.arange(m), len(v))
np.add.at(b, (i0, i1), v.ravel())
np.add.at(c, (i0, i1), 1)
np.add.at(d, (i0, i1), (v > (b / c)[i]).ravel())
pd.DataFrame(
d, pd.MultiIndex.from_tuples(r, names=cols),
df.columns.difference(cols)
)
col3 col4 col5 col6
col1 col2
A B 2 2 3 2
C D 2 1 1 2
E F 1 1 1 1
G H 0 0 0 0
Upvotes: 1