finethen
finethen

Reputation: 433

Python, Pandas: Compare two dataframes and return combined

Good evening,

I would like to know, which is the best way to compare two dataframes and return a combination of them? Or if there's even a build-in function inside pandas?

For example, these are my two dataframes:

Dataframe 01:

first_name | age | id | value_a | value_b | value_c
peter      | 37  | 19 | 4562    | 78      | 21.5
jane       | 32  | 5  | 3832    | 85      | 17.0
michael    | 43  | 41 | 2195    | 63      | 44.4

Dataframe 02:

first_name | age | id | value_a | value_b | value_c
sarah      | 51  | 2  | 63      | 81      | 4.1
peter      | 37  | 19 | 4562    | 81      | 21.5
tom        | 22  | 89 | 107     | 14      | 0.0
michael    | 43  | 41 | 1838    | 63      | 44.4

As you can see, there are some new entrys troughout the whole dataframe (Dataframe 02) and some of the already existing ones are also listed --> some changes were made in these rows! What I want to achieve is a new(?) dataframe that contains all the new rows, the already existing ones and those who got updated! In this case:

Dataframe New

first_name | age | id | value_a | value_b | value_c
peter      | 37  | 19 | 4562    | 81      | 21.5
jane       | 32  | 5  | 3832    | 85      | 17.0
michael    | 43  | 41 | 1838    | 63      | 44.4
sarah      | 51  | 2  | 63      | 81      | 4.1
tom        | 22  | 89 | 107     | 14      | 0.0

Notes:

Thanks for all your help and a great evening!

Upvotes: 3

Views: 4224

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Since you asked if there's even a built-in function inside pandas?. The answer is yes, there is a built in function in pandas that allows you to compare identically labelled (with same index and columns) dataframe's.

There is a DataFrame.compare function which is available in pandas version >= 1.1.0 and allows you to compare first dataframe to second DataFrame and show the differences:

So, now lets look at your example as you said

  • there is always a column (here: 'id') that can be seen as a non changing key
  • the amount and names of the colums are always staying the same

Therefore in order to compare the two dataframes you first need to align both the dataframe this can be done using DataFrame.align function after setting the column id as common index in both the dataframes:

d1, d2 = df1.set_index('id').align(df2.set_index('id'))

Now you can use DataFrame.compare on aligned dataframes:

d1.compare(d2, keep_equal=True)

which results:

         first_name     age           value_a         value_b       value_c      
         self    other  self other    self   other    self other    self other
id                                                                            
2         NaN    sarah   NaN  51.0     NaN    63.0     NaN  81.0     NaN   4.1
5        jane      NaN  32.0   NaN  3832.0     NaN    85.0   NaN    17.0   NaN
19      peter    peter  37.0  37.0  4562.0  4562.0    78.0  81.0    21.5  21.5
41    michael  michael  43.0  43.0  2195.0  1838.0    63.0  63.0    44.4  44.4
89        NaN      tom   NaN  22.0     NaN   107.0     NaN  14.0     NaN   0.0

Now coming to your second question:

How to achieve is a new(?) dataframe that contains all the new rows, the already existing ones and those who got updated!

You can use DataFrame.comine_first on the aligned dataframes d1 and d2:

d2.combine_first(d1)

Or, on the non aligned ones as follows:

df2.set_index('id').combine_first(df1.set_index('id'))

which results:

   first_name   age  value_a  value_b  value_c
id                                            
2       sarah  51.0     63.0     81.0      4.1
5        jane  32.0   3832.0     85.0     17.0
19      peter  37.0   4562.0     81.0     21.5
41    michael  43.0   1838.0     63.0     44.4
89        tom  22.0    107.0     14.0      0.0

Upvotes: 4

Danail Petrov
Danail Petrov

Reputation: 1875

Here is one way of doing it

>>> (pd.concat([df1, df2])
       .drop_duplicates(subset=['id','first_name'], keep='last')
       .reset_index(drop=True)
       .set_index('first_name')
     )

first_name  |  age   |  id   |   value_a  |  value_b   |   value_c
jane        |   32   |   5   |     3832   |       85   |     17.0
sarah       |   51   |   2   |       63   |       81   |      4.1
peter       |   37   |  19   |     4562   |       81   |     21.5
tom         |   22   |  89   |      107   |       14   |      0.0
michael     |   43   |  41   |     1838   |       63   |     44.4

Upvotes: 2

Related Questions