I'm new you to pandas and python, and I want to remove duplicates but give it a priority. It's hard to explain but I will give an example to make it clear ID Phone Email 0001 0234+ null 0001 null a@.com 0001 0234+ a@.com how I can remove the duplicates in ID and leave the third one because it has both phone and email and not removing it randomly, and if the id for example has no complete of both values it will still remain one

Reputation: 425

How to drop duplicate with priority in pandas

I'm new you to pandas and python, and I want to remove duplicates but give it a priority. It's hard to explain but I will give an example to make it clear

ID      Phone   Email
0001    0234+    null
0001    null    [email protected]
0001    0234+    [email protected]

how I can remove the duplicates in ID and leave the third one because it has both phone and email and not removing it randomly, and if the id for example has no complete of both values it will still remain one

Upvotes: 0

Answers (3)

Drsaud

Reputation: 425

I solve this by take each case to new data frame for example if both email and phone have value will set it a firstdf, then if email only has value it will be in seconddf, etc.

then I concat them and append it to new data frame as final result and remove id duplicate (by that I set the most important case at top)

code:

# drop if both is null
ff = ff.dropna(subset=["الجوال", 'البريد الالكتروني'] , how="all")

#hh = ff with both not null
hh = ff.dropna(subset=["الجوال", 'البريد الالكتروني'])

## ss = ff with email false and phone true
ss = ff.dropna(subset=["الجوال"])

## yy = ff with email true and phone false
yy = ff.dropna(subset=["البريد الالكتروني"])

#### solution to give priority which to drop we take the most important one top
df1=pd.concat([hh,ss],axis=0)
len(hh) + len(ss)

df2=pd.concat([df1,yy],axis=0)
len(df1) + len(yy)

final= df2.copy()

final= final.drop_duplicates(subset=["رقم الهوية"])

final.to_excel(r'Result.xlsx',index=False)

Upvotes: 0

yudhiesh

Reputation: 6819

You can just drop the NaN values based on Phone and Email.

df.dropna(subset=['Phone', 'Email'], inplace=True)

Upvotes: 0

arp5

Reputation: 169

First Drop NaNs in rows and then drop duplicates

df2 = df.dropna(subset=['Phone']).dropna(subset=['Email']).drop_duplicates('ID')

Upvotes: 1

How to drop duplicate with priority in pandas

Answers (3)

Related Questions