Reputation: 29
We have 2 excel files one having 7.5k records and other having 7k records. We need to compare the data by keeping one specific column as fixed from one sheet to compare with another sheet.
For Example sheet1:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
2 | B | 456 | CBD
-------------------------------------
3 | C | 789 | S
For Example sheet2:
**Emp_ID| Name| Phone| Address**
-------------------------------------
1 | A | 123 | ABC
-------------------------------------
3 | C | 789 | S
Python Comparison should on the basis of Emp_ID and Emp_ID=2 should be output as missing when passing the argument as Emp_ID while executing the python script. I am trying the same by using XLRD module, but its comparing only cell by cell instead of freezing one column and then comparing the row with other excel file.
def compareexcel(oldSheet, newSheet):
rowb2 = xlrd.open_workbook(oldSheet)
rowb1 = xlrd.open_workbook(newSheet)
sheet1 = rowb1.sheet_by_index(0)
sheet2 = rowb2.sheet_by_index(0)
for rownum in range(max(sheet1.nrows, sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
row_rb2 = sheet2.row_values(rownum)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
Upvotes: 0
Views: 594
Reputation: 93
I have written one function to search for a column value into another sheet and on the basis of that comparison would take place in compare function
def search(sheet2 , s):
for row in range(sheet2.nrows):`enter code here`
if s == sheet2.cell(row,0).value:
return (row,0)
return (9,9)
def compare(oldPerPaxSheet,newPerPaxSheet):
rb1 = xlrd.open_workbook(oldPerPaxSheet)
rb2 = xlrd.open_workbook(newPerPaxSheet)
sheet1 = rb1.sheet_by_index(0)
sheet2 = rb2.sheet_by_index(0)
for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)):
if rownum < sheet1.nrows:
row_rb1 = sheet1.row_values(rownum)
print ("row_rb1 : "), row_rb1
search_str = sheet1.cell(rownum,0).value
r,c = search(sheet2,search_str)
if (c != 9):
row_rb2 = sheet2.row_values(r)
for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
if c1 != c2:
print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
else:
print ("ROw does not exists in the other sheet")
pass
else:
print ("Row {} missing").format(rownum+1)
Upvotes: 1
Reputation: 7058
You can easily use pandas.read_excel
for this.
I will make 2 DataFrames with Emp_ID
as index
import pandas as pd
sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0)
sheet1 = sheets[old_sheet]
sheet2 = sheets[new_sheet]
I added some rows to have clearer differences
sheet1
Name Phone Address
Emp_ID
1 A 123 ABC
2 B 456 CBD
3 C 789 S
5 A 123 ABC
sheet2
Name Phone Address
Emp_ID
1 A 123 ABC
3 C 789 S
4 D 12 A
5 E 123 ABC
calculating the missing Emp_ID
becomes very simple then
missing_in_1 = set(sheet2.index) - set(sheet1.index)
missing_in_2 = set(sheet1.index) - set(sheet2.index)
missing_in_1, missing_in_2
({4}, {2})
so sheet1 has no Emp_ID
4 which is in sheet2, and sheet2 lacks a 2, as expected
Then to look for the differences, we do an inner-join on the 2 sheets
combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2'))
combined
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
1 A 123 ABC A 123 ABC
3 C 789 S C 789 S
5 A 123 ABC E 123 ABC
and loop over the columns of sheet1 to look for differences and save these in a dict
differences = {}
for column in sheet1.columns:
diff = combined[column+'_1'] != combined[column+'_2']
if diff.any():
differences[column] = list(combined[diff].index)
differences
{'Name': [5]}
If you want the whole list of differences, you change the last line to differences[column] = combined[diff]
differences
{'Name':
Name_1 Phone_1 Address_1 Name_2 Phone_2 Address_2
Emp_ID
5 A 123 ABC E 123 ABC}
Upvotes: 0