cpx
cpx

Reputation: 17567

Compare DataFrames for Different Rows regardless of row order

I have been trying to compare two dataframes to find missing rows and different rows:

Case 1: Same number of rows and different rows:

In this case, I have same number of rows, but two different rows:

dict_a = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grapes"]}
dict_b = {'Values 1':[15, 3, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

enter image description here

So, I can find the different row indexes using:

list(df1[~df1.isin(df2)].dropna(how = 'all').index)

This results in [1, 5].

Case 2: Different number of rows and different rows:

In this case, I have different number of rows, and two different rows.

dict_a = {'Values 1':[15, 2, 3, 24, 1], 'Values 2':[10, 7, 3, 5, 6], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana"]}
dict_b = {'Values 1':[15, 1, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

display(df1)
display(df2)

enter image description here

Here, first I check for missing rows. From this, I can find that it is the fifth index by comparing the dataframe with greater length with shorter length:

df2[~df2.index.isin(df1.index)]

I can also find the different rows:

df1[~df1.isin(df2)].dropna(how = 'all')

which are [1, 4].

Case 3: Interchanged rows, but same row data

dict_a = {'Values 1':[2, 15, 3, 24, 5, 16], 'Values 2':[7, 10, 3, 5, 6, 23], 
          'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

display(df1)
display(df2)

However, the code doesn't work for interchanged rows. It needs to check if row already exists regardless of position and return a result "there are no different rows", but instead it says rows at index [0, 1] are different.

enter image description here

So, I am looking for a solution which can compare the two dataframes regardless of row order or position, missing rows, and different rows.

So, my goal is to compare two dataframes and return "missing rows" and also "different rows" i.e. which don't completely match each other. Is there a such a function which can actually compare the rows in this way?

Upvotes: 4

Views: 2121

Answers (3)

Mortz
Mortz

Reputation: 4879

You could use sets -

s1 = set(df1.to_string(index=False, header=False).split('\n'))
s2 = set(df2.to_string(index=False, header=False).split('\n'))
for item in s1.union(s2):
    if item in s1.intersection(s2):
        print(f'{item}|both')
    elif item in s1:
        print(f'{item}|left_only')
    else:
        print(f'{item}|right_only')
#  1   7  Orange|right_only
#  5   6  Banana|right_only
# 24   5  Cherry|both
#  2   7  Orange|left_only
# 15  10   Apple|both
# 16  23   Grape|right_only
#  3   3    Kiwi|both
#  1   6  Banana|left_only

Upvotes: 1

Paul Brennan
Paul Brennan

Reputation: 2696

Here is a great trick that I learned by accident pd.merge does a great job showing this

dict_a = {'Values 1':[2, 15, 3, 24, 5, 16], 'Values 2':[7, 10, 3, 5, 6, 23], 
          'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)

df3 = pd.merge(df1,df2,how='outer',on=['Values 1','Values 2','Values 3'],indicator=True)
print(df3)

is your last case you can see that this solves the problem with order

   Values 1  Values 2 Values 3 _merge
0         2         7   Orange   both
1        15        10    Apple   both
2         3         3     Kiwi   both
3        24         5   Cherry   both
4         5         6   Banana   both
5        16        23    Grape   both

Now go back to an earlier case

dict_c = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Nut"]}
df4 = pd.DataFrame(dict_c)
df5 = pd.merge(df1,df4,how='outer', on=['Values 1','Values 2','Values 3'],indicator=True)
print(df5)

You get

   Values 1  Values 2 Values 3      _merge
0         2         7   Orange        both
1        15        10    Apple        both
2         3         3     Kiwi        both
3        24         5   Cherry        both
4         5         6   Banana        both
5        16        23    Grape   left_only
6        16        23      Nut  right_only

The indicators give you where they are different. Using the on variable also makes it so that you can use a subset of the columns to fix your problem.

Upvotes: 2

Sam
Sam

Reputation: 467

For the missing rows, just use drop duplicates together with a concat

import pandas as pd
dict_a = {'Values 1':[2, 15, 3, 24, 5, 16, 2], 'Values 2':[7, 10, 3, 5, 6, 23,1], 
          'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape", "test"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16, 1], 'Values 2':[10, 7, 3, 5, 6, 23, 0], 
          'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape", "test"]}

df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)
pd.concat([df1, df2]).drop_duplicates(keep=False)

enter image description here

For the different rows I suggest the following logic. You compare the entire dataframe and check where the sum of the row is not equal to 3.

df1[(df1 == df2).apply(sum, axis = 1) != 3] 

enter image description here

df2[(df1 == df2).apply(sum, axis = 1) != 3] 

enter image description here

Upvotes: 0

Related Questions