Reputation: 1961
I have a dataframe with the structure
name code
Big J 323
Big J 323
Big J 323
Big J Online 323
Big J Online 323
Big J Online 323
Code Base 476
Code Base 476
Sometimes there are duplicates that contain online. In this situation, I want to keep all of the online versions and remove the offline versions. If there is no online version, I want to keep the offline version.
The result would be
Big J Online 323
Big J Online 323
Big J Online 323
Code Base 476
Code Base 476
Is there an efficient way to filter this out?
Upvotes: 1
Views: 43
Reputation: 862511
you can filter rows with Online
, remove it by replace
and filter out this rows in Series.isin
(solution not working per groups):
vals = df.loc[df['name'].str.contains(' Online'), 'name'].str.replace(' Online','')
df = df[~df['name'].isin(vals)]
print (df)
name code
3 Big J Online 323
4 Big J Online 323
5 Big J Online 323
6 Code Base 476
7 Code Base 476
Same solution per groups:
m = df['name'].str.contains('Online')
f = lambda x: x['name'].isin(x['new'])
df = df[~df.assign(new = df.loc[m, 'name'].str.replace(' Online',''))
.groupby('code', group_keys=False)
.apply(f)]
print (df)
name code
3 Big J Online 323
4 Big J Online 323
5 Big J Online 323
6 Code Base 476
7 Code Base 476
Upvotes: 2
Reputation: 28644
One possible way to solve it is to create a temporary column that checks if 'online' exists in any of the values in name
per code
, then filter out those rows:
(df.assign(online = df['name'].str.contains('Online'),
checker = lambda df: df.groupby('code').online.transform('sum'))
.loc[lambda df: ~((df.online.eq(0)) & (df.checker.gt(0))), df.columns]
)
name code
3 Big J Online 323
4 Big J Online 323
5 Big J Online 323
6 Code Base 476
7 Code Base 476
Upvotes: 0