Reputation: 51
I am working on data comparison project. Here i have 2 excel files 'Prod1' and 'Proj1' which i had imported in python as data frames. Both the data frames have 46 identical columns with identical labels. But different number of rows across both files (e.g. Prod1 is having 100 while Proj1 is 110). I want to compare cell by cell values and highlight the mismatch as 'TRUE' or 'FALSE'in another dataframe. Which can be further exported in excel. Till now i am able to read the files but got struck in comparison part. Kindly assist
import os
import pandas as pd
import numpy as np
import xlrd
os.getcwd()
os.chdir("C:\\Users\\desktop\\Python Project")
Prod1= pd.read_excel("Prod1.xls")
df1 = pd.read_excel('Prod1.xls', 'Prod1')
print(df1)
df2 = pd.read_excel('Proj1.xls', 'Proj1')
print(df2)
Upvotes: 0
Views: 381
Reputation: 1411
Assuming df2 is the larger dataframe, as you say in your explanation, resize to match df1:
>>> df1 = df1.reindex(index=df2.index)
Then just print the following:
>>> print(df1 == df2)
Example
# Define two Pandas Dataframes
>>> df1 = pd.DataFrame(data={'col1': [1,2], 'col2': [3,4]})
>>> df2 = pd.DataFrame(data={'col1': [1,2,3], 'col2': [3,4,5]})
>>> df1
col1 col2
0 1 3
1 2 4
>>> df2
col1 col2
0 1 3
1 2 4
2 3 5
The Dataframes purposely don't match in size, so if you tried to do the following:
>>> print(df1 == df2)
You would see an error like this:
ValueError: Can only compare identically-labeled DataFrame objects
In order to compare the two, use the index of the bigger Dataframe df2 (this describes the labels applied to each row) and we use it to redefine the index of df1. Note this only works if the two datasets utilise the same indexing.
>>> df1 = df1.reindex(index=df2.index)
>>> df1
col1 col2
0 1.0 3.0
1 2.0 4.0
2 NaN NaN
Now if you try to compare them, you will see the following:
>>> print(df1 == df2)
col1 col2
0 True True
1 True True
2 False False
Upvotes: 1
Reputation: 18906
This is the approach you want to have, first let's recreate two dataframes:
import itertools
import pandas as pd
import numpy as np
np.random.seed(1234)
df1 = pd.DataFrame(np.random.randint(0,2,size=(2,2)),columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,2,size=(4,2)),columns=list('AB'))
Df1, Df2:
A B A B
0 1 1 0 0 0 # False
1 0 1 1 0 1 # True
2 1 1 # False
3 1 1 # False
Now we compare those dataframes row-wise by zipping the values together and make sure all .all()
values are the same.
[(a==b).all() for a,b in itertools.zip_longest(df1.values, df2.values)]
Or without izip_longest:
[(a==b).all() for a,b in zip(df1.values, df2.values)]+[False]*abs(len(df2)-len(df1))
And output is:
[False, True, False, False]
Upvotes: 0