user2120375
user2120375

Reputation: 31

Compare 2 .csv files with Python then output results

I'm fairly new at programming and I am trying to write a python program that will compare 2 .csv files by specific columns and check for additions, removals, and modifications. The .csv files are both in the following format, contain the same amount of columns, and use BillingNumber as the key:

BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State
"2","CHARLIE RYAN","Yes","No","Yes","Reading","PA"
"3","INSURANCE BILLS","","","","",""
"4","AAA","","","","",""

I need to compare only columns 0, 1, 2, and 4. I have tried many different ways to accomplish this but I haven't had any luck. I understand that I can load them into dictionaries using csv.DictReader or csv.reader, but after that I get stuck. I'm not sure exactly where or how to start after loading them into memory.

I tried this previously:

import time
old_lines = set((line.strip() for line in open(r'Old/file1.csv', 'r+')))
file_new = open(r'New/file2.csv', 'r+')

choice = 0
choice = int( input('\nPlease choose your result format.\nEnter 1 for .txt, 2 for .csv or 3 for .json\n') )
time.sleep(1)
print(".")
time.sleep(1)
print("..")
time.sleep(1)
print("...")
time.sleep(1)
print("....")
time.sleep(1)
print('Done! Check "Different" folder for results.\n')
if choice == 1:
    file_diff = open(r'Different/diff.txt', 'w')
elif choice == 2:
    file_diff = open(r'Different/diff.csv', 'w')
elif choice == 3:
    file_diff = open(r'Different/diff.json', "w")
else: 
    print ("You MUST enter 1, 2 or 3")
    exit()

for line in file_new:
    if line.strip() not in old_lines:
        file_diff.write("** ERROR! Entry "+ line + "** Does not match previous file\n\n")
file_new.close()
file_diff.close()

It doesn't work properly because if there is an additional line, or one is missing, it logs everything after that line as different. Also it compares the whole line which is not what I want to do. This was basically just a starting point and although it kind of worked, it isn't specific enough for what I need. I'm really just looking for a good place to start. Thanks!

Upvotes: 2

Views: 2581

Answers (4)

sotapme
sotapme

Reputation: 4903

Because the requirements of these things have a tendency to spiral I think it would be worth putting the data in a SQLite database.

As the logic for detecting whether a row is deleted or just new can be tricky to implement.

In the below I've presumed that BillingNumber is the id and not to be change.

import sqlite3
con = sqlite3.connect(":memory:")

cursor = con.cursor()
columns = "BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State"
cursor.execute("CREATE TABLE left  (%s);" % columns)
cursor.execute("CREATE TABLE right (%s);" % columns) 

placeholders = ",".join("?" * len(columns.split(',')))

import csv
def reader(filename):
    for (lineno, line) in enumerate(open(filename)):
        if lineno > 0: # skip header
            yield line

def load_table(tablebname, filename):
    for row in csv.reader(reader(filename)):
        cursor.execute("INSERT INTO %s VALUES(%s);" % (tablebname, placeholders), row)

load_table("left",  "left.csv")
load_table("right", "right.csv")

if False:
    print "LEFT"
    for row in cursor.execute("SELECT * from left;"):
        print row[0]

        print "RIGHT"
        for row in cursor.execute("SELECT * from right;"):
            print row

def dataset(tablename, columns):
    for row in cursor.execute("SELECT * from %s;" % tablename):
        yield tuple(row[x] for x in columns)

# To use if raw data required.       
#left_dataset = dataset("left", [0,1,2,4])
#right_dataset = dataset("right", [0,1,2,4])

# COMPARE functions.
def different_rows():
    q = """SELECT left.*, right.* 
    FROM left, right
    WHERE left.BillingNumber = right.BillingNumber
    AND ( left.CustomerName !=  right.CustomerName OR 
          left.IsActive     !=  right.IsActive OR
          left.IsPayScan    !=  right.IsPayScan )
          ;
    """
    for row in cursor.execute(q):
        print "DIFFERENCE", row

def new_rows():
    q = """SELECT right.* 
    FROM right
    WHERE right.BillingNumber NOT IN ( SELECT BillingNumber FROM left)
          ;
    """
    for row in cursor.execute(q):
        print "NEW", row

different_rows()
new_rows()

The OP would have to write different functions to compare the data but I on the whole using SQL might be easier.

Upvotes: 0

Tom Offermann
Tom Offermann

Reputation: 1421

I think you were on the right track using the csv module. Since 'BillingNumber' is a unique key, I would create one dict for the "old" billing file, and another for the "new" billing file:

import csv

def make_billing_dict(csv_dict_reader):
    bdict = {}
    for entry in csv_dict_reader:
        key = entry['BillingNumber']
        bdict[key] = entry
    return bdict

with open('old.csv') as csv_file:
    old = csv.DictReader(csv_file)
    old_bills = make_billing_dict(old)

That results in this data structure for old_bills:

{'2': {'BillingNumber': '2',
       'City': 'Reading',
       'CustomerName': 'CHARLIE RYAN',
       'IsActive': 'Yes',
       'IsCreditHold': 'No',
       'IsPayScan': 'Yes',
       'State': 'PA'},
 '3': {'BillingNumber': '3',
       'City': '',
       'CustomerName': 'INSURANCE BILLS',
       'IsActive': '',
       'IsCreditHold': '',
       'IsPayScan': '',
       'State': ''},
 '4': {'BillingNumber': '4',
       'City': '',
       'CustomerName': 'AAA',
       'IsActive': '',
       'IsCreditHold': '',
       'IsPayScan': '',
       'State': ''}}

Once you create the same data structure for the "new" billing file, you can easily find the differences:

# Keys that are in old_bills, but not new_bills
print set(old_bills.keys()) - set(new_bills.keys())

# Keys that are in new_bills, but not old_bills
print set(new_bills.keys()) - set(old_bills.keys())

# Compare columns for same billing records
# Will print True or False
print old_bills['2']['CustomerName'] == new_bills['2']['CustomerName']
print old_bills['2']['IsActive'] == new_bills['2']['IsActive']

Obviously, you wouldn't write a separate print statement for each comparison. I'm just demonstrating how to use the data structures to find differences. Next, you should write a function to loop through all possible BillingNumbers and check for differences between old and new...but I'll leave that part for you.

Upvotes: 1

Wilduck
Wilduck

Reputation: 14096

Reading your comment:

It's just something i'm trying to figure out. They put up a job listing for a new tech guy at work and the person they will hire has to solve this problem.

They very well may be looking for some command-line-fu. Something akin to

diff <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv1.csv) <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv2.csv)

A command that will work in bash, using the diff tool, to compare certain columns, selected using awk.

This is clearly not a python based solution. However, this solution does demonstrate the power of simple unix-based tools.

Upvotes: 0

David Marx
David Marx

Reputation: 8558

Do you have to write this yourself? If this is a programming exercise, all power to you. Otherwise, look for a tool called "diff" which probably exists in some form you already have access to. It's built into many other tools, such as text editors and like vim, emacs and notepad++, and version control systems like subversion mercurial and git.

I recommend you use the established workhorse instead of reinventing the wheel. git diff is a beast.

Upvotes: 0

Related Questions