Abhishek Gupta
Abhishek Gupta

Reputation: 51

Compare 2 dataframes imported from excel

I am working on data comparison project. Here i have 2 excel files 'Prod1' and 'Proj1' which i had imported in python as data frames. Both the data frames have 46 identical columns with identical labels. But different number of rows across both files (e.g. Prod1 is having 100 while Proj1 is 110). I want to compare cell by cell values and highlight the mismatch as 'TRUE' or 'FALSE'in another dataframe. Which can be further exported in excel. Till now i am able to read the files but got struck in comparison part. Kindly assist

import os

import pandas as pd

import numpy as np

import xlrd

os.getcwd()

os.chdir("C:\\Users\\desktop\\Python Project")

Prod1= pd.read_excel("Prod1.xls")

df1 = pd.read_excel('Prod1.xls', 'Prod1')

print(df1)

df2 = pd.read_excel('Proj1.xls', 'Proj1')

print(df2) 

Upvotes: 0

Views: 381

Answers (2)

Benjamin Rowell
Benjamin Rowell

Reputation: 1411

Assuming df2 is the larger dataframe, as you say in your explanation, resize to match df1:

>>> df1 = df1.reindex(index=df2.index)

Then just print the following:

>>> print(df1 == df2)

Example

# Define two Pandas Dataframes

>>> df1 = pd.DataFrame(data={'col1': [1,2], 'col2': [3,4]})
>>> df2 = pd.DataFrame(data={'col1': [1,2,3], 'col2': [3,4,5]})

>>> df1
   col1  col2
0     1     3
1     2     4

>>> df2
   col1  col2
0     1     3
1     2     4
2     3     5

The Dataframes purposely don't match in size, so if you tried to do the following:

>>> print(df1 == df2)

You would see an error like this:

ValueError: Can only compare identically-labeled DataFrame objects

In order to compare the two, use the index of the bigger Dataframe df2 (this describes the labels applied to each row) and we use it to redefine the index of df1. Note this only works if the two datasets utilise the same indexing.

>>> df1 = df1.reindex(index=df2.index)
>>> df1
   col1  col2
0   1.0   3.0
1   2.0   4.0
2   NaN   NaN

Now if you try to compare them, you will see the following:

>>> print(df1 == df2)
    col1   col2
0   True   True
1   True   True
2  False  False

Upvotes: 1

Anton vBR
Anton vBR

Reputation: 18906

This is the approach you want to have, first let's recreate two dataframes:

import itertools
import pandas as pd
import numpy as np

np.random.seed(1234)

df1 = pd.DataFrame(np.random.randint(0,2,size=(2,2)),columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,2,size=(4,2)),columns=list('AB'))

Df1, Df2:

   A  B       A  B
0  1  1    0  0  0       # False
1  0  1    1  0  1       # True
           2  1  1       # False
           3  1  1       # False

Now we compare those dataframes row-wise by zipping the values together and make sure all .all() values are the same.

[(a==b).all() for a,b in itertools.zip_longest(df1.values, df2.values)]

Or without izip_longest:

[(a==b).all() for a,b in zip(df1.values, df2.values)]+[False]*abs(len(df2)-len(df1))

And output is:

[False, True, False, False]

Upvotes: 0

Related Questions