Reputation: 365
I have a dataFrame with more than 200 features, and I put a part of the dataset to show the problem:
index ID X1 X2 Date1 Y1
0 2 324 634 2016-01-01 NaN
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654
4 4 543 843 2017-02-01 NaN
5 5 523 843 2015-09-01 NaN
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
As you can see the rows are duplicated (in ID, X1, X2 and Date1) and I want to remove one of the rows which are similar in ID, X1, X2, Date1 and Y1 which contains NaN. So, my desired DataFrame should be:
index ID X1 X2 Date1 Y1
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
Does any one know, how I can handle it?
Upvotes: 3
Views: 205
Reputation: 402613
Use sort_values
on "Y1" to move NaNs to the bottom of the DataFrame, and then use drop_duplicates
:
df2 = (df.sort_values('Y1', na_position='last')
.drop_duplicates(['ID', 'X1', 'X2', 'Date1'], keep='first')
.sort_index())
df2
ID X1 X2 Date1 Y1
index
1 2 324 634 2016-01-01 1224.0
3 4 543 843 2017-02-01 654.0
6 5 523 843 2015-09-01 1121.0
7 6 500 897 2015-11-01 NaN
Upvotes: 2
Reputation: 1924
just use drop_duplicates
function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
df \
.orderBy(Y1).desc()) \
.drop_duplicates(subset='ID')
Upvotes: 1