pierre_j
pierre_j

Reputation: 983

Merging two dataframes and removing duplicate rows WITH duplicate indexes (pandas)

I have read miscellaneous posts with a similar question but couldn't find exactly this question. I have two pandas DataFrames that I want to merge. They have timestamps as indexes. The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.

I would like to remove these rows because they share everything: index and values in columns. If they don't share both index and values in columns, I want to keep them.

So far, I could point out:

Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns

DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.

To give an example, I am re-using the data given in below answer.

df1

                        Value
 2012-02-01 12:00:00     10
 2012-02-01 12:30:00     10
 2012-02-01 13:00:00     20
 2012-02-01 13:30:00     30

df2

                        Value
 2012-02-01 12:30:00     20
 2012-02-01 13:00:00     20
 2012-02-01 13:30:00     30
 2012-02-02 14:00:00     10

Result I would like to obtain is the following one:

                        Value
 2012-02-01 12:00:00     10  #(from df1)
 2012-02-01 12:30:00     10  #(from df1)
 2012-02-01 12:30:00     20  #(from df2 - same index than in df1, but different value)
 2012-02-01 13:00:00     20  #(in df1 & df2, only one kept)
 2012-02-01 13:30:00     30  #(in df1 & df2, only one kept)
 2012-02-02 14:00:00     10  #(from df2)

Please, any idea? Thanks for your help! Bests

Upvotes: 2

Views: 5611

Answers (2)

Monica
Monica

Reputation: 11

Just improving the first answer, insert Date inside drop_duplicates

pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30971

Assume that you have 2 following DataFrames:

  1. df:

                     Date  Value
    0 2012-02-01 12:00:00     10
    1 2012-02-01 12:30:00     10
    2 2012-02-01 13:00:00     20
    3 2012-02-01 13:30:00     30
    4 2012-02-02 14:00:00     10
    5 2012-02-02 14:30:00     10
    6 2012-02-02 15:00:00     20
    7 2012-02-02 15:30:00     30
    
  2. df2:

                     Date  Value
    0 2012-02-01 12:00:00     10
    1 2012-02-01 12:30:00     21
    2 2012-02-01 12:40:00     22
    3 2012-02-01 13:00:00     20
    4 2012-02-01 13:30:00     30
    

To generate the result, run:

pd.concat([df, df2]).sort_values('Date')\
    .drop_duplicates().reset_index(drop=True)

The result, for the above data, is:

                 Date  Value
0 2012-02-01 12:00:00     10
1 2012-02-01 12:30:00     10
2 2012-02-01 12:30:00     21
3 2012-02-01 12:40:00     22
4 2012-02-01 13:00:00     20
5 2012-02-01 13:30:00     30
6 2012-02-02 14:00:00     10
7 2012-02-02 14:30:00     10
8 2012-02-02 15:00:00     20
9 2012-02-02 15:30:00     30

drop_duplicates drops duplicated rows, keeping the first. Since no subset parameter has been passed, the criterion to treat 2 rows as duplicates is identity of all columns.

Upvotes: 3

Related Questions