Shashidhar
Shashidhar

Reputation: 95

Group id's based on condition on another column

I have the following data frame:

Hotel_id    Month_Year      Chef_Id  Chef_is_masterchef  
2400614     May-2015        2297544     0           
2400614     June-2015       2297544     0            
2400614     July-2015       2297544     0               
2400614     August-2015     2297544     0               
2400614     September-2015  2297544     0               
2400614     October-2015    2297544     0               
2400614     November-2015   2297544     0           
2400614     December-2015   2297544     0          
2400614     January-2016    2297544     1        
2400614     February-2016   2297544     1           
2400614     March-2016      2297544     1         

3400624     May-2016        2597531     0           
3400624     June-2016       2597531     0         
3400624     July-2016       2597531     0          
3400624     August-2016     2597531     1           
3400624     September-2016  2597531     1           

2400133     February-2016   4597531     0           
2400133     March-2016      4597531     0               
2400133     April-2016      4597531     0         
2400133     May-2016        4597531     0               
2400133     June-2016       4597531     0           
2400133     July-2016       4597531     0           
2400133     August-2016     4597531     1           
2400133     September-2016  4597531     1               
2400133     October-2016    4597531     1               
2400133     November-2016   4597531     1               
2400133     December-2016   4597531     1               
2400133     January-2017    4597531     1           
2400133     February-2017   4597531     1               
2400133     March-2017      4597531     1           
2400133     April-2017      4597531     1         
2400133     May-2017        4597531     1       

I should capture the Hotel_id, where each id should have the minimum count of three 0's and 1's in the Chef_is_masterchef column and neglect other Hotel_id which doesn't meet the condition.

After performing the required operation, It looks something like this

Hotel_id    Month_Year      Chef_Id  Chef_is_masterchef  
2400614     May-2015        2297544     0           
2400614     June-2015       2297544     0            
2400614     July-2015       2297544     0               
2400614     August-2015     2297544     0               
2400614     September-2015  2297544     0               
2400614     October-2015    2297544     0               
2400614     November-2015   2297544     0           
2400614     December-2015   2297544     0          
2400614     January-2016    2297544     1        
2400614     February-2016   2297544     1           
2400614     March-2016      2297544     1         
        

2400133     February-2016   4597531     0           
2400133     March-2016      4597531     0               
2400133     April-2016      4597531     0         
2400133     May-2016        4597531     0               
2400133     June-2016       4597531     0           
2400133     July-2016       4597531     0           
2400133     August-2016     4597531     1           
2400133     September-2016  4597531     1               
2400133     October-2016    4597531     1               
2400133     November-2016   4597531     1               
2400133     December-2016   4597531     1               
2400133     January-2017    4597531     1           
2400133     February-2017   4597531     1               
2400133     March-2017      4597531     1           
2400133     April-2017      4597531     1         
2400133     May-2017        4597531     1   

So in the above data frame, one id got deleted.

Also, I have to repeat the same process to get another data frame where each Hotel_id should have the minimum count of six 0's and 1's in the Chef_is_masterchef column and neglect other Hotel_id which doesn't meet the condition.

In this case, two ids get deleted and look something like this,

Hotel_id    Month_Year      Chef_Id  Chef_is_masterchef 
2400133     February-2016   4597531     0           
2400133     March-2016      4597531     0               
2400133     April-2016      4597531     0         
2400133     May-2016        4597531     0               
2400133     June-2016       4597531     0           
2400133     July-2016       4597531     0           
2400133     August-2016     4597531     1           
2400133     September-2016  4597531     1               
2400133     October-2016    4597531     1               
2400133     November-2016   4597531     1               
2400133     December-2016   4597531     1               
2400133     January-2017    4597531     1           
2400133     February-2017   4597531     1               
2400133     March-2017      4597531     1           
2400133     April-2017      4597531     1         
2400133     May-2017        4597531     1

Please let me know the solution.

Upvotes: 1

Views: 96

Answers (1)

jezrael
jezrael

Reputation: 862581

Solutions if only 0 and 1 values in column Chef_is_masterchef:

First count values per groups by crosstab and filter if both columns has more or equal values by N by DataFrame.ge for both columns by DataFrame.all and then pass index values which match both conditions to Series.isin:

N = 3
s = pd.crosstab(df['Hotel_id'], df['Chef_is_masterchef']).ge(N).all(1)

df1 = df[df['Hotel_id'].isin(s.index[s])]

Or you can create new column for test 0 values called zeros and use GroupBy.transform with sum for count 0 and 1 values per groups, output is mask passed to boolean indexing:

N = 3
mask = (df.assign(zeros=df['Chef_is_masterchef'].eq(0))
          .groupby('Hotel_id')[['Chef_is_masterchef', 'zeros']]
          .transform('sum')
          .ge(N)
          .all(axis=1))
df1 = df[mask]

print (df1)
    Hotel_id      Month_Year  Chef_Id  Chef_is_masterchef
0    2400614        May-2015  2297544                   0
1    2400614       June-2015  2297544                   0
2    2400614       July-2015  2297544                   0
3    2400614     August-2015  2297544                   0
4    2400614  September-2015  2297544                   0
5    2400614    October-2015  2297544                   0
6    2400614   November-2015  2297544                   0
7    2400614   December-2015  2297544                   0
8    2400614    January-2016  2297544                   1
9    2400614   February-2016  2297544                   1
10   2400614      March-2016  2297544                   1
16   2400133   February-2016  4597531                   0
17   2400133      March-2016  4597531                   0
18   2400133      April-2016  4597531                   0
19   2400133        May-2016  4597531                   0
20   2400133       June-2016  4597531                   0
21   2400133       July-2016  4597531                   0
22   2400133     August-2016  4597531                   1
23   2400133  September-2016  4597531                   1
24   2400133    October-2016  4597531                   1
25   2400133   November-2016  4597531                   1
26   2400133   December-2016  4597531                   1
27   2400133    January-2017  4597531                   1
28   2400133   February-2017  4597531                   1
29   2400133      March-2017  4597531                   1
30   2400133      April-2017  4597531                   1
31   2400133        May-2017  4597531                   1

N = 6
mask = (df.assign(zeros=df['Chef_is_masterchef'].eq(0))
          .groupby('Hotel_id')[['Chef_is_masterchef', 'zeros']]
          .transform('sum')
          .ge(N)
          .all(axis=1))
df2 = df[mask]
print (df2)
    Hotel_id      Month_Year  Chef_Id  Chef_is_masterchef
16   2400133   February-2016  4597531                   0
17   2400133      March-2016  4597531                   0
18   2400133      April-2016  4597531                   0
19   2400133        May-2016  4597531                   0
20   2400133       June-2016  4597531                   0
21   2400133       July-2016  4597531                   0
22   2400133     August-2016  4597531                   1
23   2400133  September-2016  4597531                   1
24   2400133    October-2016  4597531                   1
25   2400133   November-2016  4597531                   1
26   2400133   December-2016  4597531                   1
27   2400133    January-2017  4597531                   1
28   2400133   February-2017  4597531                   1
29   2400133      March-2017  4597531                   1
30   2400133      April-2017  4597531                   1
31   2400133        May-2017  4597531                   1

If possible some another values like 0and 1 is possible use:

N = 3
mask = (df.assign(zeros= df['Chef_is_masterchef'].eq(0),
                  ones = df['Chef_is_masterchef'].eq(1))
          .groupby('Hotel_id')[['ones', 'zeros']]
          .transform('sum')
          .ge(N)
          .all(axis=1))

df1 = df[mask]

Upvotes: 1

Related Questions