Nikhil
Nikhil

Reputation: 29

compare 2 excel files by keeping 1 column fixed of 1 sheet and then comparing with another file with same col by using python

We have 2 excel files one having 7.5k records and other having 7k records. We need to compare the data by keeping one specific column as fixed from one sheet to compare with another sheet.

For Example sheet1:

**Emp_ID|   Name|   Phone|  Address**
-------------------------------------
1       |     A |    123 |  ABC
-------------------------------------
2       |     B |    456 |  CBD
-------------------------------------
3       |     C |    789 |  S

For Example sheet2:

**Emp_ID|   Name|   Phone|  Address**
-------------------------------------
1       |     A |    123 |  ABC
-------------------------------------
3       |     C |    789 |  S

Python Comparison should on the basis of Emp_ID and Emp_ID=2 should be output as missing when passing the argument as Emp_ID while executing the python script. I am trying the same by using XLRD module, but its comparing only cell by cell instead of freezing one column and then comparing the row with other excel file.

def compareexcel(oldSheet, newSheet):
        rowb2 = xlrd.open_workbook(oldSheet)
        rowb1 = xlrd.open_workbook(newSheet)
        sheet1 = rowb1.sheet_by_index(0)
        sheet2 = rowb2.sheet_by_index(0)

        for rownum in range(max(sheet1.nrows, sheet2.nrows)):
            if rownum < sheet1.nrows:
                row_rb1 = sheet1.row_values(rownum)
                row_rb2 = sheet2.row_values(rownum)

                for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
                    if c1 != c2:                    
                        print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)

Upvotes: 0

Views: 594

Answers (2)

JyotiGrover
JyotiGrover

Reputation: 93

I have written one function to search for a column value into another sheet and on the basis of that comparison would take place in compare function

def search(sheet2 , s):
    for row in range(sheet2.nrows):`enter code here`
        if s == sheet2.cell(row,0).value:
            return (row,0)
    return (9,9)

def compare(oldPerPaxSheet,newPerPaxSheet):
    rb1 = xlrd.open_workbook(oldPerPaxSheet)
    rb2 = xlrd.open_workbook(newPerPaxSheet)
    sheet1 = rb1.sheet_by_index(0)
    sheet2 = rb2.sheet_by_index(0)

    for rownum in range(max(self.sheet1.nrows, self.sheet2.nrows)):
            if rownum < sheet1.nrows:
                    row_rb1 = sheet1.row_values(rownum)
                    print ("row_rb1 : "), row_rb1

                    search_str = sheet1.cell(rownum,0).value

                    r,c = search(sheet2,search_str)
                    if (c != 9):
                            row_rb2 = sheet2.row_values(r)
                            for colnum, (c1, c2) in enumerate(izip_longest(row_rb1, row_rb2)):
                                    if c1 != c2:                    
                                            print "Row {} Col {} - {} != {}".format(rownum+1, colnum+1, c1, c2)
                    else:
                            print ("ROw does not exists in the other sheet")
                            pass
            else:
                    print ("Row {} missing").format(rownum+1)

Upvotes: 1

Maarten Fabr&#233;
Maarten Fabr&#233;

Reputation: 7058

You can easily use pandas.read_excel for this.

I will make 2 DataFrames with Emp_ID as index

import pandas as pd

sheets = pd.read_excel(excel_filename, sheetname=[old_sheet, new_sheet], index_col=0)
sheet1 = sheets[old_sheet]
sheet2 = sheets[new_sheet]

I added some rows to have clearer differences

sheet1

    Name    Phone   Address
Emp_ID          
1   A   123     ABC
2   B   456     CBD
3   C   789     S
5   A   123     ABC

sheet2

    Name    Phone   Address
Emp_ID          
1   A   123     ABC
3   C   789     S
4   D   12  A
5   E   123     ABC

calculating the missing Emp_ID becomes very simple then

missing_in_1  = set(sheet2.index) - set(sheet1.index) 
missing_in_2  = set(sheet1.index) - set(sheet2.index) 

missing_in_1, missing_in_2

({4}, {2})

so sheet1 has no Emp_ID 4 which is in sheet2, and sheet2 lacks a 2, as expected

Then to look for the differences, we do an inner-join on the 2 sheets

combined = pd.merge(sheet1, sheet2, left_index=True, right_index=True, suffixes=('_1', '_2'))

combined

    Name_1  Phone_1     Address_1   Name_2  Phone_2     Address_2
Emp_ID                      
1   A   123     ABC     A   123     ABC
3   C   789     S   C   789     S
5   A   123     ABC     E   123     ABC

and loop over the columns of sheet1 to look for differences and save these in a dict

differences = {}
for column in sheet1.columns:
    diff = combined[column+'_1'] != combined[column+'_2']
    if diff.any():
        differences[column] = list(combined[diff].index)

differences

{'Name': [5]}

If you want the whole list of differences, you change the last line to differences[column] = combined[diff]

differences

{'Name':        
         Name_1  Phone_1 Address_1 Name_2  Phone_2 Address_2
 Emp_ID                                                    
 5           A      123       ABC      E      123       ABC}

Upvotes: 0

Related Questions