Reputation: 95
I have the following data frame:
Hotel_id Month_Year Chef_Id Chef_is_masterchef
2400614 May-2015 2297544 0
2400614 June-2015 2297544 0
2400614 July-2015 2297544 0
2400614 August-2015 2297544 0
2400614 September-2015 2297544 0
2400614 October-2015 2297544 0
2400614 November-2015 2297544 0
2400614 December-2015 2297544 0
2400614 January-2016 2297544 1
2400614 February-2016 2297544 1
2400614 March-2016 2297544 1
3400624 May-2016 2597531 0
3400624 June-2016 2597531 0
3400624 July-2016 2597531 0
3400624 August-2016 2597531 1
3400624 September-2016 2597531 1
2400133 February-2016 4597531 0
2400133 March-2016 4597531 0
2400133 April-2016 4597531 0
2400133 May-2016 4597531 0
2400133 June-2016 4597531 0
2400133 July-2016 4597531 0
2400133 August-2016 4597531 1
2400133 September-2016 4597531 1
2400133 October-2016 4597531 1
2400133 November-2016 4597531 1
2400133 December-2016 4597531 1
2400133 January-2017 4597531 1
2400133 February-2017 4597531 1
2400133 March-2017 4597531 1
2400133 April-2017 4597531 1
2400133 May-2017 4597531 1
I should capture the Hotel_id, where each id should have the minimum count of three 0's and 1's in the Chef_is_masterchef column and neglect other Hotel_id which doesn't meet the condition.
After performing the required operation, It looks something like this
Hotel_id Month_Year Chef_Id Chef_is_masterchef
2400614 May-2015 2297544 0
2400614 June-2015 2297544 0
2400614 July-2015 2297544 0
2400614 August-2015 2297544 0
2400614 September-2015 2297544 0
2400614 October-2015 2297544 0
2400614 November-2015 2297544 0
2400614 December-2015 2297544 0
2400614 January-2016 2297544 1
2400614 February-2016 2297544 1
2400614 March-2016 2297544 1
2400133 February-2016 4597531 0
2400133 March-2016 4597531 0
2400133 April-2016 4597531 0
2400133 May-2016 4597531 0
2400133 June-2016 4597531 0
2400133 July-2016 4597531 0
2400133 August-2016 4597531 1
2400133 September-2016 4597531 1
2400133 October-2016 4597531 1
2400133 November-2016 4597531 1
2400133 December-2016 4597531 1
2400133 January-2017 4597531 1
2400133 February-2017 4597531 1
2400133 March-2017 4597531 1
2400133 April-2017 4597531 1
2400133 May-2017 4597531 1
So in the above data frame, one id got deleted.
Also, I have to repeat the same process to get another data frame where each Hotel_id should have the minimum count of six 0's and 1's in the Chef_is_masterchef column and neglect other Hotel_id which doesn't meet the condition.
In this case, two ids get deleted and look something like this,
Hotel_id Month_Year Chef_Id Chef_is_masterchef
2400133 February-2016 4597531 0
2400133 March-2016 4597531 0
2400133 April-2016 4597531 0
2400133 May-2016 4597531 0
2400133 June-2016 4597531 0
2400133 July-2016 4597531 0
2400133 August-2016 4597531 1
2400133 September-2016 4597531 1
2400133 October-2016 4597531 1
2400133 November-2016 4597531 1
2400133 December-2016 4597531 1
2400133 January-2017 4597531 1
2400133 February-2017 4597531 1
2400133 March-2017 4597531 1
2400133 April-2017 4597531 1
2400133 May-2017 4597531 1
Please let me know the solution.
Upvotes: 1
Views: 96
Reputation: 862581
Solutions if only 0
and 1
values in column Chef_is_masterchef
:
First count values per groups by crosstab
and filter if both columns has more or equal values by N
by DataFrame.ge
for both columns by DataFrame.all
and then pass index values which match both conditions to Series.isin
:
N = 3
s = pd.crosstab(df['Hotel_id'], df['Chef_is_masterchef']).ge(N).all(1)
df1 = df[df['Hotel_id'].isin(s.index[s])]
Or you can create new column for test 0
values called zeros
and use GroupBy.transform
with sum
for count 0
and 1
values per groups, output is mask
passed to boolean indexing
:
N = 3
mask = (df.assign(zeros=df['Chef_is_masterchef'].eq(0))
.groupby('Hotel_id')[['Chef_is_masterchef', 'zeros']]
.transform('sum')
.ge(N)
.all(axis=1))
df1 = df[mask]
print (df1)
Hotel_id Month_Year Chef_Id Chef_is_masterchef
0 2400614 May-2015 2297544 0
1 2400614 June-2015 2297544 0
2 2400614 July-2015 2297544 0
3 2400614 August-2015 2297544 0
4 2400614 September-2015 2297544 0
5 2400614 October-2015 2297544 0
6 2400614 November-2015 2297544 0
7 2400614 December-2015 2297544 0
8 2400614 January-2016 2297544 1
9 2400614 February-2016 2297544 1
10 2400614 March-2016 2297544 1
16 2400133 February-2016 4597531 0
17 2400133 March-2016 4597531 0
18 2400133 April-2016 4597531 0
19 2400133 May-2016 4597531 0
20 2400133 June-2016 4597531 0
21 2400133 July-2016 4597531 0
22 2400133 August-2016 4597531 1
23 2400133 September-2016 4597531 1
24 2400133 October-2016 4597531 1
25 2400133 November-2016 4597531 1
26 2400133 December-2016 4597531 1
27 2400133 January-2017 4597531 1
28 2400133 February-2017 4597531 1
29 2400133 March-2017 4597531 1
30 2400133 April-2017 4597531 1
31 2400133 May-2017 4597531 1
N = 6
mask = (df.assign(zeros=df['Chef_is_masterchef'].eq(0))
.groupby('Hotel_id')[['Chef_is_masterchef', 'zeros']]
.transform('sum')
.ge(N)
.all(axis=1))
df2 = df[mask]
print (df2)
Hotel_id Month_Year Chef_Id Chef_is_masterchef
16 2400133 February-2016 4597531 0
17 2400133 March-2016 4597531 0
18 2400133 April-2016 4597531 0
19 2400133 May-2016 4597531 0
20 2400133 June-2016 4597531 0
21 2400133 July-2016 4597531 0
22 2400133 August-2016 4597531 1
23 2400133 September-2016 4597531 1
24 2400133 October-2016 4597531 1
25 2400133 November-2016 4597531 1
26 2400133 December-2016 4597531 1
27 2400133 January-2017 4597531 1
28 2400133 February-2017 4597531 1
29 2400133 March-2017 4597531 1
30 2400133 April-2017 4597531 1
31 2400133 May-2017 4597531 1
If possible some another values like 0
and 1
is possible use:
N = 3
mask = (df.assign(zeros= df['Chef_is_masterchef'].eq(0),
ones = df['Chef_is_masterchef'].eq(1))
.groupby('Hotel_id')[['ones', 'zeros']]
.transform('sum')
.ge(N)
.all(axis=1))
df1 = df[mask]
Upvotes: 1