Spedo
Spedo

Reputation: 365

How to remove duplicate rows from a DataFrame, where the duplicates have NaNs?

I have a dataFrame with more than 200 features, and I put a part of the dataset to show the problem:

   index  ID   X1   X2       Date1      Y1     
      0   2   324  634  2016-01-01     NaN    
      1   2   324  634  2016-01-01   1224.0    
      3   4   543  843  2017-02-01     654    
      4   4   543  843  2017-02-01     NaN    
      5   5   523  843  2015-09-01     NaN    
      6   5   523  843  2015-09-01    1121.0
      7   6   500  897  2015-11-01     NaN  

As you can see the rows are duplicated (in ID, X1, X2 and Date1) and I want to remove one of the rows which are similar in ID, X1, X2, Date1 and Y1 which contains NaN. So, my desired DataFrame should be:

   index  ID   X1   X2       Date1      Y1     

      1   2   324  634  2016-01-01   1224.0    
      3   4   543  843  2017-02-01     654    
      6   5   523  843  2015-09-01    1121.0
      7   6   500  897  2015-11-01     NaN 

Does any one know, how I can handle it?

Upvotes: 3

Views: 205

Answers (2)

cs95
cs95

Reputation: 402613

Use sort_values on "Y1" to move NaNs to the bottom of the DataFrame, and then use drop_duplicates:

df2 = (df.sort_values('Y1', na_position='last')
         .drop_duplicates(['ID', 'X1', 'X2', 'Date1'], keep='first')
         .sort_index())

df2
       ID   X1   X2       Date1      Y1
index                                  
1       2  324  634  2016-01-01  1224.0
3       4  543  843  2017-02-01   654.0
6       5  523  843  2015-09-01  1121.0
7       6  500  897  2015-11-01     NaN

Upvotes: 2

frankegoesdown
frankegoesdown

Reputation: 1924

just use drop_duplicates function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

df \
.orderBy(Y1).desc()) \
.drop_duplicates(subset='ID')

Upvotes: 1

Related Questions