Jiao
Jiao

Reputation: 219

Pandas groupby and count numbers of item by conditions

I have a dataframe like this:

df_test = pd.DataFrame({'ID1':['A','A','A','A','A','A','B','B','B','B'],
                       'ID2':['a','a','a','aa','aaa','aaa','b','b','bb','bb'],
                       'ID3':['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'],
                       'condition':['','!','','!','','','!','','','!']})

df_test

I want group by them by ID1. the result dataframe need to like this (the count_condition means the number of '!')for each ID2 group:

df_test_result = pd.DataFrame({'ID1':['A','A','A','B','B'],
                       'ID2':['a','aa','aaa','b','bb'],
                        'Count_ID2':[3,1,2,2,2],
                        'Count_ID3':[3,1,2,2,2],
                        'Count_condition': [1,1,0,1,1]})

df_test_result

I have tried to use groupby and agg to obtain this result, but I can not get the number of '!' for each group. here's my command

df_test_result = df_test.groupby(['ID1','ID2']).agg({'ID2':'count','ID3':'nunique','condition':'count'})

if there's a kind of command like this wrong one:

df_test = df_test.groupby(['ID1','ID2']).agg({'ID2':'count','ID3':'nunique','condition' == '!':'count'})

Upvotes: 4

Views: 3012

Answers (4)

sophocles
sophocles

Reputation: 13821

You can use a named groupby:

df_test.groupby(
    ['ID1','ID2']).agg(
    Count_ID2=('ID2', 'count'),
    Count_ID3=('ID3', 'count'),
    Count_condition=("condition", lambda x: str(x).count('!')))

prints:

         Count_ID2  Count_ID3  Count_condition
ID1 ID2                                       
A   a            3          3                1
    aa           1          1                1
    aaa          2          2                0
B   b            2          2                1
    bb           2          2                1

In the above we are counting the occurences with aggfunc="count" for columns "ID2" and "ID3", and creating a small custom function which count's the occurences of ! for the "condition" column. We do the aforementioned for each group and we returned named columns for our aggregation results

Upvotes: 6

Corralien
Corralien

Reputation: 120391

You can use count for Count_condition column if you replace empty values by np.nan or pd.NA before the aggregation:

df_test_result = (
    df_test.replace({'condition': {'': pd.NA}}).groupby(['ID1', 'ID2'])
           .agg({'ID2': 'count','ID3': 'nunique','condition': 'count'})
           .add_prefix('Count_').reset_index()
)

Output:

>>> df_test_result
  ID1  ID2  Count_ID2  Count_ID3  Count_condition
0   A    a          3          3                1
1   A   aa          1          1                1
2   A  aaa          2          2                0
3   B    b          2          2                1
4   B   bb          2          2                1

Update

Another solution is to convert your initial condition column to boolean and use sum to count values:

df_test_result = (
    df_test.astype({'condition': bool}).groupby(['ID1', 'ID2'])
           .agg({'ID2': 'count','ID3': 'nunique','condition': 'sum'})
           .add_prefix('Count_').reset_index()
)

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28649

@sophocles' answer suffices, this uses the same idea (named aggregation), but precomputes the binaries before aggregation (should be hopefully faster):

(pd.get_dummies(df_test, columns=['condition'])
   .drop(columns='condition_')
   .groupby(['ID1', 'ID2'])
   .agg(count_ID1=('ID2', 'size'), 
        count_ID2=('ID3', 'size'), 
        count_condition=('condition_!', 'sum'))
)
         count_ID1  count_ID2  count_condition
ID1 ID2
A   a            3          3                1
    aa           1          1                1
    aaa          2          2                0
B   b            2          2                1
    bb           2          2                1

Note that named aggregation are not necessarily faster; they do offer a convenience of renaming aggregated columns.

Upvotes: 1

mozway
mozway

Reputation: 260300

Variation on your initial attempt:

(df_test.groupby(['ID1','ID2'])
        .agg({'ID2':'count',
              'ID3':'nunique',
              'condition': lambda s: s.eq('!').sum()})
        .add_prefix('Count_')
        .reset_index()
)

output:

  ID1  ID2  Count_ID2  Count_ID3  Count_condition
0   A    a          3          3                1
1   A   aa          1          1                1
2   A  aaa          2          2                0
3   B    b          2          2                1
4   B   bb          2          2                1

Upvotes: 1

Related Questions