Reputation: 433
Good evening,
I would like to know, which is the best way to compare two dataframes and return a combination of them? Or if there's even a build-in function inside pandas?
For example, these are my two dataframes:
Dataframe 01:
first_name | age | id | value_a | value_b | value_c
peter | 37 | 19 | 4562 | 78 | 21.5
jane | 32 | 5 | 3832 | 85 | 17.0
michael | 43 | 41 | 2195 | 63 | 44.4
Dataframe 02:
first_name | age | id | value_a | value_b | value_c
sarah | 51 | 2 | 63 | 81 | 4.1
peter | 37 | 19 | 4562 | 81 | 21.5
tom | 22 | 89 | 107 | 14 | 0.0
michael | 43 | 41 | 1838 | 63 | 44.4
As you can see, there are some new entrys troughout the whole dataframe (Dataframe 02) and some of the already existing ones are also listed --> some changes were made in these rows! What I want to achieve is a new(?) dataframe that contains all the new rows, the already existing ones and those who got updated! In this case:
Dataframe New
first_name | age | id | value_a | value_b | value_c
peter | 37 | 19 | 4562 | 81 | 21.5
jane | 32 | 5 | 3832 | 85 | 17.0
michael | 43 | 41 | 1838 | 63 | 44.4
sarah | 51 | 2 | 63 | 81 | 4.1
tom | 22 | 89 | 107 | 14 | 0.0
Notes:
Thanks for all your help and a great evening!
Upvotes: 3
Views: 4224
Reputation: 71689
Since you asked if there's even a built-in function inside pandas?. The answer is yes, there is a built in function in pandas that allows you to compare identically labelled (with same index and columns) dataframe's.
There is a DataFrame.compare
function which is available in pandas version >= 1.1.0
and allows you to compare first dataframe to second DataFrame and show the differences:
So, now lets look at your example as you said
- there is always a column (here: 'id') that can be seen as a non changing key
- the amount and names of the colums are always staying the same
Therefore in order to compare the two dataframes you first need to align
both the dataframe this can be done using DataFrame.align
function after setting the column id
as common index
in both the dataframes:
d1, d2 = df1.set_index('id').align(df2.set_index('id'))
Now you can use DataFrame.compare
on aligned dataframes:
d1.compare(d2, keep_equal=True)
which results:
first_name age value_a value_b value_c
self other self other self other self other self other
id
2 NaN sarah NaN 51.0 NaN 63.0 NaN 81.0 NaN 4.1
5 jane NaN 32.0 NaN 3832.0 NaN 85.0 NaN 17.0 NaN
19 peter peter 37.0 37.0 4562.0 4562.0 78.0 81.0 21.5 21.5
41 michael michael 43.0 43.0 2195.0 1838.0 63.0 63.0 44.4 44.4
89 NaN tom NaN 22.0 NaN 107.0 NaN 14.0 NaN 0.0
Now coming to your second question:
How to achieve is a new(?) dataframe that contains all the new rows, the already existing ones and those who got updated!
You can use DataFrame.comine_first
on the aligned dataframes d1
and d2
:
d2.combine_first(d1)
Or, on the non aligned ones as follows:
df2.set_index('id').combine_first(df1.set_index('id'))
which results:
first_name age value_a value_b value_c
id
2 sarah 51.0 63.0 81.0 4.1
5 jane 32.0 3832.0 85.0 17.0
19 peter 37.0 4562.0 81.0 21.5
41 michael 43.0 1838.0 63.0 44.4
89 tom 22.0 107.0 14.0 0.0
Upvotes: 4
Reputation: 1875
Here is one way of doing it
>>> (pd.concat([df1, df2])
.drop_duplicates(subset=['id','first_name'], keep='last')
.reset_index(drop=True)
.set_index('first_name')
)
first_name | age | id | value_a | value_b | value_c
jane | 32 | 5 | 3832 | 85 | 17.0
sarah | 51 | 2 | 63 | 81 | 4.1
peter | 37 | 19 | 4562 | 81 | 21.5
tom | 22 | 89 | 107 | 14 | 0.0
michael | 43 | 41 | 1838 | 63 | 44.4
Upvotes: 2