Reputation: 1314
I have two Dataframes with rows with an identical (corresponding) index, which I want to merge. Every row has an update-time. For rows with identical indexes the row with the higher update-time wins. All fields from the ‘newer’ row should be taken, except for the fields were only in the ‘older’ row are values. Example:
df1 = pd.DataFrame({'Hugo' : {'age' : 21, 'weight' : 75},
'Niklas': {'age' : 46, 'weight' : 65},
'Ronald' : {'age' : 76, 'weight' : 85, 'height' : 176}}).T
df1.index.names = ['name']
df1['update_time'] = 1
df2 = pd.DataFrame({'Hugo' : {'age' : 22, 'weight' : 77},
'Bertram': {'age' : 45, 'weight' : 65, 'height' : 190},
'Donald' : {'age' : 75, 'weight' : 85},
'Ronald' : {'age' : 77, 'weight' : 84}}).T
df2.index.names = ['name']
df2['update_time'] = 2
df1:
+--------+-------+----------+----------+---------------+
| name | age | height | weight | update_time |
|--------+-------+----------+----------+---------------|
| Hugo | 21 | nan | 75 | 1 |
| Niklas | 46 | nan | 65 | 1 |
| Ronald | 76 | 176 | 85 | 1 |
+--------+-------+----------+----------+---------------+
df2:
+---------+-------+----------+---------------+
| name | age | weight | update_time |
|---------+-------+----------+---------------|
| Bertram | 45 | 65 | 2 |
| Donald | 75 | 85 | 2 |
| Hugo | 22 | 77 | 2 |
| Ronald | 77 | 84 | 2 |
+---------+-------+----------+---------------+
Result should look like this:
+---------+-------+----------+----------+---------------+
| name | age | height | weight | update_time |
|---------+-------+----------+----------+---------------|
| Niklas | 46 | nan | 65 | 1 |
| Bertram | 45 | 190 | 65 | 2 |
| Donald | 75 | nan | 85 | 2 |
| Hugo | 22 | nan | 77 | 2 |
| Ronald | 77 | 176 | 84 | 2 |
+---------+-------+----------+----------+---------------+
How could I do this ? The Problem is to keep the field with the height of Ronald. If I do first an df.Update of df1 then the timestamp isn't there anymore and i cannot find the older duplicates. If I do an df.append I can't merge the fields.
Upvotes: 1
Views: 382
Reputation: 166
I am putting my answer here as parity, being another working solution I have used primarily:
rdpz = lambda df: df.groupby(df.index).last()
df3 = rdpz( pd.concat([df1, df2]) )
Here, the function rdpz
groups the concatenated list of dataframes by their indices and chooses the last entry (first
can be substituted for last
if list order is reversed).
As said previously, this is not nearly as clean as Scott Boston's answer, but helpful if you want to extend to multiple dataframes or have more tricky multiindexes. In one line, you can use:
df3 = pd.concat([df1, df2]).apply(lambda df: df.groupby(df.index).last())
Upvotes: 0
Reputation: 153520
Use combine_first
:
df2.combine_first(df1)
output:
age height weight update_time
name
Bertram 45.0 190.0 65.0 2.0
Donald 75.0 NaN 85.0 2.0
Hugo 22.0 NaN 77.0 2.0
Niklas 46.0 NaN 65.0 1.0
Ronald 77.0 176.0 84.0 2.0
Upvotes: 6