Reputation: 63
I have a dataframe with 3 columns. I would like to drop duplicates in column A based on values in other columns. I have searched tirelessly and cant find a solution like this.
example:
A | B | C |
---|---|---|
Family1 | nan | nan |
Family1 | nan | 1234 |
Family1 | 1245 | nan |
Family1 | 3456 | 78787 |
Family2 | nan | nan |
Family3 | nan | nan |
Basically i want to drop a duplicate ONLY IF the rest of the columns are both nan. otherwise, the duplicate can stay.
desired output:
A | B | C |
---|---|---|
Family1 | nan | 1234 |
Family1 | 1245 | nan |
Family1 | 3456 | 78787 |
Family2 | nan | nan |
Family3 | nan | nan |
Family2 and Family3 remain in the df because they dont have duplicates, even though both columns are nan
Upvotes: 1
Views: 71
Reputation: 23099
try a double boolean, this returns true for all duplicates & true for any column after ['A'] that are all nulls. If both conditions are met we will exclude this using the ~
operator which inverts a boolean.
df[~(df.duplicated(subset=['A'],keep=False) & df.iloc[:,1:].isna().all(1))]
A B C
1 Family1 NaN 1234
2 Family1 1245 NaN
3 Family1 3456 78787
4 Family2 NaN NaN
5 Family3 NaN NaN
Upvotes: 3
Reputation: 26686
You were not very clear. I suspect you want to drop any duplicates in column A if both columns B and C are NaN. If so, please try;
df[~(df.A.duplicated(keep=False)&(df.B.isna()&df.C.isna()))]
Upvotes: 3