Reputation: 271
df = pd.DataFrame({"name":["A", "A", "B" ,"B", "C", "C"],
"nickname":["X","Y","X","Z","Y", "Y"]})
How can I group df by "name" and drop those groups that contains only 'Y'? In my case 'C' should be dropped.
I am using below code but it is not working:
df_new = df.groupby('name').filter(lambda x: all(x['nickname'] != 'Y'))
In case, Y occurs in any other 'name' with some other nickname then that name should be retained. Kindly help.
Upvotes: 1
Views: 140
Reputation: 862521
Here groupby
is not necessary. You can use boolean indexing
:
df = df[df['name'].isin(df.loc[df['nickname'].ne('Y'), 'name'].unique())]
print (df)
name nickname
0 A X
1 A Y
2 B X
3 B Z
Explanation:
First compare by ne
for not equal values:
print (df['nickname'].ne('Y'))
0 True
1 False
2 True
3 True
4 False
5 False
Name: nickname, dtype: bool
Then select column name
bu boolean mask:
print (df.loc[df['nickname'].ne('Y'), 'name'])
0 A
2 B
3 B
Name: name, dtype: object
For better performance get unique values:
print(df.loc[df['nickname'].ne('Y'), 'name'].unique())
['A' 'B']
And filter by isin
for final mask:
print (df['name'].isin(df.loc[df['nickname'].ne('Y'), 'name'].unique()))
0 True
1 True
2 True
3 True
4 False
5 False
Name: name, dtype: bool
Performance:
Depends of number of rows, number of unique groups and number of matched values - best test in your real data:
np.random.seed(123)
N = 100000
df = pd.DataFrame({'name': np.random.randint(1000,size=N).astype(str),
'nickname':np.random.randint(200,size=N).astype(str)})
#print (df)
In [152]: %timeit df[df.nickname.ne('Y').groupby(df.name).transform('sum').astype(bool)]
27.6 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [153]: %timeit df[~df.nickname.eq('Y').groupby(df.name).transform('all')]
27.3 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [154]: %timeit df[df['name'].isin(df.loc[df['nickname'].ne('Y'), 'name'].unique())]
28.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [155]: %timeit df[~df.assign(mask=df.nickname.eq('Y')).groupby('name').mask.transform('all')]
30.3 ms ± 469 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit df[df.groupby('name')['nickname'].transform('unique').astype(str) !="['Y']"]
15.6 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [157]: %timeit df.groupby('name').filter(lambda x: any(x['nickname'] != 'Y'))
408 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 2
Reputation: 402353
You probably want groupby
and transform
, or some derivative of the same thing.
df[~df.nickname.eq('Y').groupby(df.name).transform('all')]
# Or,
# df[~df.assign(mask=df.nickname.eq('Y')).groupby('name').mask.transform('all')]
name nickname
0 A X
1 A Y
2 B X
3 B Z
An even faster groupby
-related solution involves... counting!
df[df.nickname.ne('Y').groupby(df.name).transform('sum').astype(bool)]
name nickname
0 A X
1 A Y
2 B X
3 B Z
Upvotes: 1
Reputation: 12417
I think that maybe you needed in your solution any
instead of all
:
df_new = df.groupby('name').filter(lambda x: any(x['nickname'] != 'Y'))
Output:
name nickname
0 A X
1 A Y
2 B X
3 B Z
Upvotes: 0
Reputation: 11192
simply use this,
temp= df.groupby('name')['nickname'].transform('unique').astype(str)
df=df[temp!="['Y']"]
print df
O/P
name nickname
0 A X
1 A Y
2 B X
3 B Z
Upvotes: 0