Reputation: 1455
I have DataFrame:
df = pd.read_csv(...)
a b c d e f
1 two adc aaaa Nan mmm
2 one Nan aaa Nan nnn
1 one ab Nan Nan ww
1 two abcd aaa ff uiww
1 two a aaa d iii
I want to drop duplicates according to 'a' and 'b'.
df = df.drop_duplicates(['a', 'b'])
I want to get this result:
a b c d e f
1 two abcd aaaa ff uiww
2 one Nan aaa Nan nnn
1 one ab Nan Nan ww
I tried to use transform
and apply
, but it was not possible to reduce to one scheme. What would be the most efficient way to achieve this?
Upvotes: 2
Views: 345
Reputation: 116
Unless you need to use groupby (which is slow with large data frames) you can do the following:
def custom_drop_duplicates(dataframe):
localDF = dataframe.copy()
criteria_list = []
for i, col in enumerate(['c', 'd', 'f']):
localDF.loc[:, 'criteria{}'.format(i)] = [len(x) for x in localDF[col]]
criteria_list.append('criteria{}'.format(i))
localDF.loc[:, 'criteria{}'.format(i+1)] = [all(x not in y for x in ['m', 'n']) or any(x in y for x in ['w', 'y']) for y in localDF['f']]
criteria_list.append('criteria{}'.format(i+1))
# here you have a judgement call: if criteria are in conflict, you need to order them. I just assume they are ordered in the way you described them.
localDF.sort_values(by=criteria_list, ascending=True, inplace=True)
localDF.drop_duplicates(subset=['a', 'b'], keep='last', inplace=True)
localDF.drop(columns=criteria_list, inplace=True)
return localDF
Hope this helps
Upvotes: 1
Reputation: 323226
Based on your conditions create the function , then using agg
with groupby
def yourfunc1(x):
return x.loc[x.str.len().idxmax()]
def yourfunc2(x):
if any(x.str.contains('w|y')|(~x.str.contains('m|n'))):
return x.loc[x.str.contains('w|y')|(~x.str.contains('m|n'))].iloc[0]
else :
return x.iloc[0]
df=df.replace({'Nan':''})
s=df.groupby(['a','b'],as_index=False).agg({'c':yourfunc1,'d':yourfunc1,'e':yourfunc1,'f':yourfunc2})
a b c d e f
0 1 one ab ww
1 1 two abcd aaaa ff uiww
2 2 one aaa nnn
Upvotes: 5