Reputation: 983
I have read miscellaneous posts with a similar question but couldn't find exactly this question. I have two pandas DataFrames that I want to merge. They have timestamps as indexes. The 2nd Dataframe basically overlaps the 1st and they thus both share rows with same timestamps and values.
I would like to remove these rows because they share everything: index and values in columns. If they don't share both index and values in columns, I want to keep them.
So far, I could point out:
Index.drop_duplicate: this is not what I am looking for. It doesn't check values in columns are the same. And I want to keep rows with same timestamps but different values in columns
DataFrame.drop_duplicate: well, same as above, it doesn't check index value, and if rows are found with same values in column but different indexes, I want to keep them.
To give an example, I am re-using the data given in below answer.
df1
Value
2012-02-01 12:00:00 10
2012-02-01 12:30:00 10
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
df2
Value
2012-02-01 12:30:00 20
2012-02-01 13:00:00 20
2012-02-01 13:30:00 30
2012-02-02 14:00:00 10
Result I would like to obtain is the following one:
Value
2012-02-01 12:00:00 10 #(from df1)
2012-02-01 12:30:00 10 #(from df1)
2012-02-01 12:30:00 20 #(from df2 - same index than in df1, but different value)
2012-02-01 13:00:00 20 #(in df1 & df2, only one kept)
2012-02-01 13:30:00 30 #(in df1 & df2, only one kept)
2012-02-02 14:00:00 10 #(from df2)
Please, any idea? Thanks for your help! Bests
Upvotes: 2
Views: 5611
Reputation: 11
Just improving the first answer, insert Date inside drop_duplicates
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates('Date').reset_index(drop=True)
Upvotes: 1
Reputation: 30971
Assume that you have 2 following DataFrames:
df:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 13:00:00 20
3 2012-02-01 13:30:00 30
4 2012-02-02 14:00:00 10
5 2012-02-02 14:30:00 10
6 2012-02-02 15:00:00 20
7 2012-02-02 15:30:00 30
df2:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 21
2 2012-02-01 12:40:00 22
3 2012-02-01 13:00:00 20
4 2012-02-01 13:30:00 30
To generate the result, run:
pd.concat([df, df2]).sort_values('Date')\
.drop_duplicates().reset_index(drop=True)
The result, for the above data, is:
Date Value
0 2012-02-01 12:00:00 10
1 2012-02-01 12:30:00 10
2 2012-02-01 12:30:00 21
3 2012-02-01 12:40:00 22
4 2012-02-01 13:00:00 20
5 2012-02-01 13:30:00 30
6 2012-02-02 14:00:00 10
7 2012-02-02 14:30:00 10
8 2012-02-02 15:00:00 20
9 2012-02-02 15:30:00 30
drop_duplicates
drops duplicated rows, keeping the first.
Since no subset
parameter has been passed, the criterion to treat
2 rows as duplicates is identity of all columns.
Upvotes: 3