Reputation: 17567
I have been trying to compare two dataframes to find missing rows and different rows:
In this case, I have same number of rows, but two different rows:
dict_a = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grapes"]}
dict_b = {'Values 1':[15, 3, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}
df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)
So, I can find the different row indexes using:
list(df1[~df1.isin(df2)].dropna(how = 'all').index)
This results in [1, 5]
.
In this case, I have different number of rows, and two different rows.
dict_a = {'Values 1':[15, 2, 3, 24, 1], 'Values 2':[10, 7, 3, 5, 6],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana"]}
dict_b = {'Values 1':[15, 1, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}
df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)
display(df1)
display(df2)
Here, first I check for missing rows. From this, I can find that it is the fifth index by comparing the dataframe with greater length with shorter length:
df2[~df2.index.isin(df1.index)]
I can also find the different rows:
df1[~df1.isin(df2)].dropna(how = 'all')
which are [1, 4]
.
dict_a = {'Values 1':[2, 15, 3, 24, 5, 16], 'Values 2':[7, 10, 3, 5, 6, 23],
'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}
df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)
display(df1)
display(df2)
However, the code doesn't work for interchanged rows. It needs to check if row already exists regardless of position and return a result "there are no different rows", but instead it says rows at index [0, 1]
are different.
So, I am looking for a solution which can compare the two dataframes regardless of row order or position, missing rows, and different rows.
So, my goal is to compare two dataframes and return "missing rows" and also "different rows" i.e. which don't completely match each other. Is there a such a function which can actually compare the rows in this way?
Upvotes: 4
Views: 2121
Reputation: 4879
You could use sets -
s1 = set(df1.to_string(index=False, header=False).split('\n'))
s2 = set(df2.to_string(index=False, header=False).split('\n'))
for item in s1.union(s2):
if item in s1.intersection(s2):
print(f'{item}|both')
elif item in s1:
print(f'{item}|left_only')
else:
print(f'{item}|right_only')
# 1 7 Orange|right_only
# 5 6 Banana|right_only
# 24 5 Cherry|both
# 2 7 Orange|left_only
# 15 10 Apple|both
# 16 23 Grape|right_only
# 3 3 Kiwi|both
# 1 6 Banana|left_only
Upvotes: 1
Reputation: 2696
Here is a great trick that I learned by accident pd.merge does a great job showing this
dict_a = {'Values 1':[2, 15, 3, 24, 5, 16], 'Values 2':[7, 10, 3, 5, 6, 23],
'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape"]}
df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)
df3 = pd.merge(df1,df2,how='outer',on=['Values 1','Values 2','Values 3'],indicator=True)
print(df3)
is your last case you can see that this solves the problem with order
Values 1 Values 2 Values 3 _merge
0 2 7 Orange both
1 15 10 Apple both
2 3 3 Kiwi both
3 24 5 Cherry both
4 5 6 Banana both
5 16 23 Grape both
Now go back to an earlier case
dict_c = {'Values 1':[15, 2, 3, 24, 5, 16], 'Values 2':[10, 7, 3, 5, 6, 23],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Nut"]}
df4 = pd.DataFrame(dict_c)
df5 = pd.merge(df1,df4,how='outer', on=['Values 1','Values 2','Values 3'],indicator=True)
print(df5)
You get
Values 1 Values 2 Values 3 _merge
0 2 7 Orange both
1 15 10 Apple both
2 3 3 Kiwi both
3 24 5 Cherry both
4 5 6 Banana both
5 16 23 Grape left_only
6 16 23 Nut right_only
The indicators give you where they are different. Using the on variable also makes it so that you can use a subset of the columns to fix your problem.
Upvotes: 2
Reputation: 467
For the missing rows, just use drop duplicates together with a concat
import pandas as pd
dict_a = {'Values 1':[2, 15, 3, 24, 5, 16, 2], 'Values 2':[7, 10, 3, 5, 6, 23,1],
'Values 3': ["Orange", "Apple", "Kiwi", "Cherry", "Banana", "Grape", "test"]}
dict_b = {'Values 1':[15, 2, 3, 24, 5, 16, 1], 'Values 2':[10, 7, 3, 5, 6, 23, 0],
'Values 3': ["Apple", "Orange", "Kiwi", "Cherry", "Banana", "Grape", "test"]}
df1 = pd.DataFrame(dict_a)
df2 = pd.DataFrame(dict_b)
pd.concat([df1, df2]).drop_duplicates(keep=False)
For the different rows I suggest the following logic. You compare the entire dataframe and check where the sum of the row is not equal to 3.
df1[(df1 == df2).apply(sum, axis = 1) != 3]
df2[(df1 == df2).apply(sum, axis = 1) != 3]
Upvotes: 0