Reputation: 219
I have a dataframe like this:
df_test = pd.DataFrame({'ID1':['A','A','A','A','A','A','B','B','B','B'],
'ID2':['a','a','a','aa','aaa','aaa','b','b','bb','bb'],
'ID3':['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'],
'condition':['','!','','!','','','!','','','!']})
I want group by them by ID1. the result dataframe need to like this (the count_condition means the number of '!')for each ID2 group:
df_test_result = pd.DataFrame({'ID1':['A','A','A','B','B'],
'ID2':['a','aa','aaa','b','bb'],
'Count_ID2':[3,1,2,2,2],
'Count_ID3':[3,1,2,2,2],
'Count_condition': [1,1,0,1,1]})
I have tried to use groupby and agg to obtain this result, but I can not get the number of '!' for each group. here's my command
df_test_result = df_test.groupby(['ID1','ID2']).agg({'ID2':'count','ID3':'nunique','condition':'count'})
if there's a kind of command like this wrong one:
df_test = df_test.groupby(['ID1','ID2']).agg({'ID2':'count','ID3':'nunique','condition' == '!':'count'})
Upvotes: 4
Views: 3012
Reputation: 13821
You can use a named groupby
:
df_test.groupby(
['ID1','ID2']).agg(
Count_ID2=('ID2', 'count'),
Count_ID3=('ID3', 'count'),
Count_condition=("condition", lambda x: str(x).count('!')))
prints:
Count_ID2 Count_ID3 Count_condition
ID1 ID2
A a 3 3 1
aa 1 1 1
aaa 2 2 0
B b 2 2 1
bb 2 2 1
In the above we are counting the occurences with aggfunc="count"
for columns "ID2" and "ID3", and creating a small custom function which count's the occurences of !
for the "condition" column. We do the aforementioned for each group and we returned named columns for our aggregation results
Upvotes: 6
Reputation: 120391
You can use count
for Count_condition
column if you replace empty values by np.nan
or pd.NA
before the aggregation:
df_test_result = (
df_test.replace({'condition': {'': pd.NA}}).groupby(['ID1', 'ID2'])
.agg({'ID2': 'count','ID3': 'nunique','condition': 'count'})
.add_prefix('Count_').reset_index()
)
Output:
>>> df_test_result
ID1 ID2 Count_ID2 Count_ID3 Count_condition
0 A a 3 3 1
1 A aa 1 1 1
2 A aaa 2 2 0
3 B b 2 2 1
4 B bb 2 2 1
Update
Another solution is to convert your initial condition
column to boolean and use sum to count values:
df_test_result = (
df_test.astype({'condition': bool}).groupby(['ID1', 'ID2'])
.agg({'ID2': 'count','ID3': 'nunique','condition': 'sum'})
.add_prefix('Count_').reset_index()
)
Upvotes: 1
Reputation: 28649
@sophocles' answer suffices, this uses the same idea (named aggregation), but precomputes the binaries before aggregation (should be hopefully faster):
(pd.get_dummies(df_test, columns=['condition'])
.drop(columns='condition_')
.groupby(['ID1', 'ID2'])
.agg(count_ID1=('ID2', 'size'),
count_ID2=('ID3', 'size'),
count_condition=('condition_!', 'sum'))
)
count_ID1 count_ID2 count_condition
ID1 ID2
A a 3 3 1
aa 1 1 1
aaa 2 2 0
B b 2 2 1
bb 2 2 1
Note that named aggregation are not necessarily faster; they do offer a convenience of renaming aggregated columns.
Upvotes: 1
Reputation: 260300
Variation on your initial attempt:
(df_test.groupby(['ID1','ID2'])
.agg({'ID2':'count',
'ID3':'nunique',
'condition': lambda s: s.eq('!').sum()})
.add_prefix('Count_')
.reset_index()
)
output:
ID1 ID2 Count_ID2 Count_ID3 Count_condition
0 A a 3 3 1
1 A aa 1 1 1
2 A aaa 2 2 0
3 B b 2 2 1
4 B bb 2 2 1
Upvotes: 1