glpsx
glpsx

Reputation: 679

Pandas groupby with specified conditions

I'm learning Python/Pandas with a DataFrame having the following structure:

df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3], 
                   'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})

print(df1)

   unique_id brand
0          1     A
1          1     B
2          2     A
3          2     C
4          2     X
5          3     A
6          3     C
7          3     X
8          3     X
9          3     X

My goal is to make some calculations on the above DataFrame.

Specifically, for each unique_id, I want to:

Visually, using the above example, the resulting DataFrame I'm looking for should look like this:

   unique_id  count_brands_not_x  count_brand_x
0          1                   2              0
1          2                   2              1
2          3                   2              3

I have used the groupby method on simple examples in the past but I don't know how to specify conditions in a groupby to solve this new problem I have. Any help would be appreciated.

Upvotes: 0

Views: 90

Answers (4)

furas
furas

Reputation: 143110

I would first create groups and later count elements in groups

But maybe there is better function to count items in agg()

import pandas as pd

df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3], 
                   'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})

g = df1.groupby('unique_id')

df = pd.DataFrame()
df['count_brand_x'] = g['brand'].agg(lambda data:sum(data=='X'))
df['count_brands_not_x'] = g['brand'].agg(lambda data:sum(data!='X'))

df = df.reset_index()

print(df)

EDIT: If I have df['count_brand_x'] then other can count

 df['count_brands_not_x'] = g['brand'].count() - df['count_brand_x']

Upvotes: 1

Andy L.
Andy L.

Reputation: 25269

I use pd.crosstab on True/False mask of comparing against value X

s = df1.brand.eq('X')
df_final = (pd.crosstab(df1.unique_id, s)
              .rename({False: 'count_brands_not_x' , True: 'count_brand_x'}, axis=1))

Out[134]:
brand      count_brands_not_x  count_brand_x
unique_id
1                           2              0
2                           2              1
3                           2              3

Upvotes: 3

Erfan
Erfan

Reputation: 42946

You can use GroupBy and merge:

maskx = df1['brand'].eq('X')

d1 = df1[~maskx].groupby('unique_id')['brand'].size().reset_index()
d2 = df1[maskx].groupby('unique_id')['brand'].size().reset_index()

df = d1.merge(d2, on='unique_id', how='outer', suffixes=['_not_x', '_x']).fillna(0)
   unique_id  brand_not_x  brand_x
0          1            2     0.00
1          2            2     1.00
2          3            2     3.00

Upvotes: 4

ALollz
ALollz

Reputation: 59579

You can subset the original DataFrame and use the appropriate groupby operations for each calculation. concat joins the results.

import pandas as pd

s = df1.brand.eq('X')

res = (pd.concat([df1[~s].groupby('unique_id').brand.nunique().rename('unique_not_X'),
                  df1[s].groupby('unique_id').size().rename('count_X')], 
                 axis=1)
         .fillna(0))
#           unique_not_X  count_X
#unique_id                       
#1                     2      0.0
#2                     2      1.0
#3                     2      3.0

If instead of "unique_brands" you just want the number of rows with brands that are not "X" then we can perform a single groupby and unstack the result.

(df1.groupby(['unique_id', df1.brand.eq('X').map({True: 'count_X', False: 'count_not_X'})])
    .size().unstack(-1).fillna(0))
#brand      count_X  count_not_X
#unique_id                      
#1              0.0          2.0
#2              1.0          2.0
#3              3.0          2.0

Upvotes: 1

Related Questions