Compare two CSV files and output only rows with the specific columns that are different

Question

I have two CSV files with 6 columns each and both have one common column EmpID (the primary key for comparison). For Example, File1.csv is:

EmpID1,Name1,Email1,City1,Phone1,Hobby1
120034,Tom Hanks,tom.hanks@gmail.com,Mumbai,8888999,Fishing

And File2.csv is

EmpID2,Name2,Email2,City2,Phone2,Hobby2
120034,Tom Hanks,hanks.tom@gmail.com,Mumbai,8888999,Running

The files need to be compared for differences and only rows and columns that are different should be added into a new output file as

EmpID1,Email1,Email2,Hobby1,Hobby2
120034,tom.hanks@gmail.com,hanks.tom@gmail.com,Fishing,Running

Currently I have written the below piece of code in Python. Now I am wondering on how to identify and pick the differences. Any pointers and help will be much appreciated.

import csv
import os
os.getcwd()
os.chdir('filepath')

with open('File1.csv', 'r') as csv1, open('File2.csv', 'r') as csv2:

    file1 = csv1.readlines()`
    file2 = csv2.readlines()`

with open('OutputFile.csv', 'w') as output:

    for line in file1:`
        if line not in file2:
            output.write(line)

output.close()
csv1.close()
csv2.close()

nosklo · Accepted Answer

First read the files to a dict structure, with the 'EMPID' as key pointing to the entire row:

import csv
fieldnames = [] # to store all fieldnames
with open('File1.csv') as f:
    cf = csv.DictReader(f, delimiter=',')
    data1 = {row['EMPID1']: row for row in cf}
    fieldnames.extend(cf.fieldnames)

with open('File2.csv') as f:
    cf = csv.DictReader(f, delimiter=',')
    data2 = {row['EMPID2']: row for row in cf}
    fieldnames.extend(cf.fieldnames)

Then identify all ids that are in both dicts:

ids_to_check = set(data1) & set(data2)

Finally, iterate over the ids and compare the rows themselves

with open('OutputFile.csv', 'w') as f:
    cw = csv.DictWriter(f, fieldnames, delimiter=',')
    cw.writeheader()
    for id in ids_to_check:
        diff = compare_dict(data1[id], data2[id], fieldnames)
        if diff:
            cw.writerow(diff)

Here's the compare_dict function implementation:

def compare_dict(d1, d2, fields_compare):
    fields_compare = set(field.rstrip('12') for field in fields_compare)
    if any(d1[k + '1'] != d2[k + '2'] for k in fields_compare):
        # they differ, return a new dict with all fields
        result = d1.copy()
        result.update(d2)
        return result
    else:
        return {}

Compare two CSV files and output only rows with the specific columns that are different

Answers (1)

Related Questions