Sara
Sara

Reputation: 97

How to drop duplicate rows based on values of two columns?

I have a data frame like this:

Category Date_1       Score_1    Date_2           Score_2
  A      13/11/2019    5        13/11/2019        10
  A      13/11/2019    5        14/11/2019        55
  A      13/11/2019    5        15/11/2019        45
  A      13/11/2019    5        16/11/2019        80
  A      14/11/2019    3        13/11/2019        10
  A      14/11/2019    3        14/11/2019        55
  A      14/11/2019    3        15/11/2019        45
  A      14/11/2019    3        16/11/2019        80
  A      15/11/2019    7        13/11/2019        10
  A      15/11/2019    7        14/11/2019        55
  A      15/11/2019    7        15/11/2019        45
  A      15/11/2019    7        16/11/2019        80
  B      13/11/2019    4        13/11/2019        18
  B      13/11/2019    4        14/11/2019        65
  B      13/11/2019    4        15/11/2019        75
  B      13/11/2019    4        16/11/2019        89
  B      14/11/2019    9        13/11/2019        18
  B      14/11/2019    9        14/11/2019        65
  B      14/11/2019    9        15/11/2019        75
  B      14/11/2019    9        16/11/2019        89
  B      15/11/2019    8        13/11/2019        18
  B      15/11/2019    8        14/11/2019        65
  B      15/11/2019    8        15/11/2019        75
  B      15/11/2019    8        16/11/2019        89

I want to keep the rows where both dates are same. I was doing this:

df.drop_duplicates(subset=['Date_1', 'Date_2'])

But it do not work. Can`t figure out how to drop those extra rows?

Upvotes: 2

Views: 2399

Answers (2)

Stoner
Stoner

Reputation: 916

You can use duplicated with the parameter subset for specifying columns to be checked with keep=False, for all duplicates for masking and filtering by boolean indexing. The following should work:

df = df[df.duplicated(subset=['Date_1', 'Date_2'], keep=False)]

Remark: Initially, I may have misread that OP wanted to drop duplicates, with answers below:

df = df.drop_duplicates(subset=['Date_1', 'Date_2'], keep='last')

You should always assign the df back like above to apply the changes. Otherwise you can add inplace=True to apply the changes without assignment:

df.drop_duplicates(subset=['Date_1', 'Date_2'], keep='last', inplace=True)

If this doesn't work as well as the solutions provided by others, you may want to check whether there is any difference in data types for those columns which are in subset, for example datetime formatting.

Upvotes: 2

jezrael
jezrael

Reputation: 863701

Use boolean indexing with compare both columns:

df1 = df[df['Date_1'] == df['Date_2'])

Or DataFrame.query:

df1 = df.query("Date_1 == Date_2")

Upvotes: 5

Related Questions