Reputation: 425
I'm new you to pandas and python, and I want to remove duplicates but give it a priority. It's hard to explain but I will give an example to make it clear
ID Phone Email
0001 0234+ null
0001 null [email protected]
0001 0234+ [email protected]
how I can remove the duplicates in ID and leave the third one because it has both phone and email and not removing it randomly, and if the id for example has no complete of both values it will still remain one
Upvotes: 0
Views: 254
Reputation: 425
I solve this by take each case to new data frame for example if both email and phone have value will set it a firstdf, then if email only has value it will be in seconddf, etc.
then I concat them and append it to new data frame as final result and remove id duplicate (by that I set the most important case at top)
code:
# drop if both is null
ff = ff.dropna(subset=["الجوال", 'البريد الالكتروني'] , how="all")
#hh = ff with both not null
hh = ff.dropna(subset=["الجوال", 'البريد الالكتروني'])
## ss = ff with email false and phone true
ss = ff.dropna(subset=["الجوال"])
## yy = ff with email true and phone false
yy = ff.dropna(subset=["البريد الالكتروني"])
#### solution to give priority which to drop we take the most important one top
df1=pd.concat([hh,ss],axis=0)
len(hh) + len(ss)
df2=pd.concat([df1,yy],axis=0)
len(df1) + len(yy)
final= df2.copy()
final= final.drop_duplicates(subset=["رقم الهوية"])
final.to_excel(r'Result.xlsx',index=False)
Upvotes: 0
Reputation: 6819
You can just drop the NaN
values based on Phone
and Email
.
df.dropna(subset=['Phone', 'Email'], inplace=True)
Upvotes: 0
Reputation: 169
First Drop NaNs in rows and then drop duplicates
df2 = df.dropna(subset=['Phone']).dropna(subset=['Email']).drop_duplicates('ID')
Upvotes: 1