Reputation: 129
I have a dataframe, for example:
A B C D E F G
0 9 34 1 1 Nan 9 3
1 Nan 34 0 9 Nan 0 2
2 0 8 Nan 3 9 11 0
3 0 8 15 3 9 11 0
4 Nan 6 1 3 Nan 2 3
5 1 6 1 3 44 2 3
I want to check which lines have duplicated values in columns B, D and G. As you can see, lines 2 and 3 have the duplicated values, and so do lines 4 and 5. I have to delete the duplicated lines, but I want to delete the ones that have Nan values (2 and 4).
I've tried the df.dropna() method, but it only has the options
keep = 'last',
keep = 'first' or
keep='False',
and none of these solve my problem.
My expected output is
A B C D E F G
0 9 34 1 1 Nan 9 3
1 Nan 34 0 9 Nan 0 2
3 0 8 15 3 9 11 0
5 1 6 1 3 44 2 3
How can I do it? Thanks
Upvotes: 2
Views: 347
Reputation: 59274
Use &
with conditions duplicated
and isnull
for a column-independent way of achieving this
>>> to_drop = df.dropna(axis=1).duplicated(keep=False) & df.isnull().any(1)
>>> df.loc[~to_drop]
Upvotes: 0
Reputation: 8816
Simply drop.duplicates()
using subset
you can get it done.
last
: Drop duplicates except for the last occurrence.
>>> df.drop_duplicates(subset=['B', 'D', 'G'], keep="last")
A B C D E F G
0 9.0 34 1.0 1 NaN 9 3
1 NaN 34 0.0 9 NaN 0 2
3 0.0 8 15.0 3 9.0 11 0
5 1.0 6 1.0 3 44.0 2 3
Upvotes: 0
Reputation: 38415
You can use groupby with first which returns first not-null value
df.groupby(['B', 'D', 'G'], as_index = False, sort = False).first().reindex(columns = df.columns)
A B C D E F G
0 9.0 34 1.0 1 NaN 9 3
1 NaN 34 0.0 9 NaN 0 2
2 0.0 8 15.0 3 9.0 11 0
3 1.0 6 1.0 3 44.0 2 3
Upvotes: 1
Reputation: 14103
You can also use boolean indexing:
new = df[df[list('BDG')].duplicated(keep=False)]
df.iloc[~df.index.isin(new[new.isnull().any(1)].index)]
A B C D E F G
0 9 34 1 1 NaN 9 3
1 NaN 34 0 9 NaN 0 2
3 0 8 15 3 9 11 0
5 1 6 1 3 44 2 3
Upvotes: 0
Reputation: 164713
You can count nulls by row, sort by this number, then use drop_duplicates
:
df['null_count'] = df.isnull().sum(1)
df = df.sort_values('null_count')\
.drop_duplicates(['B', 'D', 'G'])\
.sort_index()
print(df)
A B C D E F G null_count
0 9.0 34.0 1.0 1.0 NaN 9.0 3.0 1
1 NaN 34.0 0.0 9.0 NaN 0.0 2.0 2
3 0.0 8.0 15.0 3.0 9.0 11.0 0.0 0
5 1.0 6.0 1.0 3.0 44.0 2.0 3.0 0
An alternative if you wish to avoid a helper column:
df = df.iloc[df.isnull().sum(1).values.argsort()]\
.drop_duplicates(['B', 'D', 'G'])\
.sort_index()
Upvotes: 1
Reputation: 14226
I am not a pandas
expert but here is one way to achieve this:
subset = df[['B', 'D', 'G']]
to_keep = df[~subset.duplicated(keep=False)]
A B C D E F G
0 9 34 1 1 NaN 9 3
1 NaN 34 0 9 NaN 0 2
to_keep.append(df[subset.duplicated(keep=False)].dropna(), ignore_index=True)
A B C D E F G
0 9 34 1 1 NaN 9 3
1 NaN 34 0 9 NaN 0 2
2 0 8 15 3 9 11 0
3 1 6 1 3 44 2 3
If you want to leave the index as is, drop the ignore_index
flag
Upvotes: 2