Reputation: 31
I'm fairly new at programming and I am trying to write a python program that will compare 2 .csv files by specific columns and check for additions, removals, and modifications. The .csv files are both in the following format, contain the same amount of columns, and use BillingNumber as the key:
BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State
"2","CHARLIE RYAN","Yes","No","Yes","Reading","PA"
"3","INSURANCE BILLS","","","","",""
"4","AAA","","","","",""
I need to compare only columns 0, 1, 2, and 4. I have tried many different ways to accomplish this but I haven't had any luck. I understand that I can load them into dictionaries using csv.DictReader
or csv.reader
, but after that I get stuck. I'm not sure exactly where or how to start after loading them into memory.
I tried this previously:
import time
old_lines = set((line.strip() for line in open(r'Old/file1.csv', 'r+')))
file_new = open(r'New/file2.csv', 'r+')
choice = 0
choice = int( input('\nPlease choose your result format.\nEnter 1 for .txt, 2 for .csv or 3 for .json\n') )
time.sleep(1)
print(".")
time.sleep(1)
print("..")
time.sleep(1)
print("...")
time.sleep(1)
print("....")
time.sleep(1)
print('Done! Check "Different" folder for results.\n')
if choice == 1:
file_diff = open(r'Different/diff.txt', 'w')
elif choice == 2:
file_diff = open(r'Different/diff.csv', 'w')
elif choice == 3:
file_diff = open(r'Different/diff.json', "w")
else:
print ("You MUST enter 1, 2 or 3")
exit()
for line in file_new:
if line.strip() not in old_lines:
file_diff.write("** ERROR! Entry "+ line + "** Does not match previous file\n\n")
file_new.close()
file_diff.close()
It doesn't work properly because if there is an additional line, or one is missing, it logs everything after that line as different. Also it compares the whole line which is not what I want to do. This was basically just a starting point and although it kind of worked, it isn't specific enough for what I need. I'm really just looking for a good place to start. Thanks!
Upvotes: 2
Views: 2581
Reputation: 4903
Because the requirements of these things have a tendency to spiral I think it would be worth putting the data in a SQLite database.
As the logic for detecting whether a row is deleted or just new can be tricky to implement.
In the below I've presumed that BillingNumber is the id and not to be change.
import sqlite3
con = sqlite3.connect(":memory:")
cursor = con.cursor()
columns = "BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State"
cursor.execute("CREATE TABLE left (%s);" % columns)
cursor.execute("CREATE TABLE right (%s);" % columns)
placeholders = ",".join("?" * len(columns.split(',')))
import csv
def reader(filename):
for (lineno, line) in enumerate(open(filename)):
if lineno > 0: # skip header
yield line
def load_table(tablebname, filename):
for row in csv.reader(reader(filename)):
cursor.execute("INSERT INTO %s VALUES(%s);" % (tablebname, placeholders), row)
load_table("left", "left.csv")
load_table("right", "right.csv")
if False:
print "LEFT"
for row in cursor.execute("SELECT * from left;"):
print row[0]
print "RIGHT"
for row in cursor.execute("SELECT * from right;"):
print row
def dataset(tablename, columns):
for row in cursor.execute("SELECT * from %s;" % tablename):
yield tuple(row[x] for x in columns)
# To use if raw data required.
#left_dataset = dataset("left", [0,1,2,4])
#right_dataset = dataset("right", [0,1,2,4])
# COMPARE functions.
def different_rows():
q = """SELECT left.*, right.*
FROM left, right
WHERE left.BillingNumber = right.BillingNumber
AND ( left.CustomerName != right.CustomerName OR
left.IsActive != right.IsActive OR
left.IsPayScan != right.IsPayScan )
;
"""
for row in cursor.execute(q):
print "DIFFERENCE", row
def new_rows():
q = """SELECT right.*
FROM right
WHERE right.BillingNumber NOT IN ( SELECT BillingNumber FROM left)
;
"""
for row in cursor.execute(q):
print "NEW", row
different_rows()
new_rows()
The OP would have to write different functions to compare the data but I on the whole using SQL might be easier.
Upvotes: 0
Reputation: 1421
I think you were on the right track using the csv module. Since 'BillingNumber' is a unique key, I would create one dict for the "old" billing file, and another for the "new" billing file:
import csv
def make_billing_dict(csv_dict_reader):
bdict = {}
for entry in csv_dict_reader:
key = entry['BillingNumber']
bdict[key] = entry
return bdict
with open('old.csv') as csv_file:
old = csv.DictReader(csv_file)
old_bills = make_billing_dict(old)
That results in this data structure for old_bills
:
{'2': {'BillingNumber': '2',
'City': 'Reading',
'CustomerName': 'CHARLIE RYAN',
'IsActive': 'Yes',
'IsCreditHold': 'No',
'IsPayScan': 'Yes',
'State': 'PA'},
'3': {'BillingNumber': '3',
'City': '',
'CustomerName': 'INSURANCE BILLS',
'IsActive': '',
'IsCreditHold': '',
'IsPayScan': '',
'State': ''},
'4': {'BillingNumber': '4',
'City': '',
'CustomerName': 'AAA',
'IsActive': '',
'IsCreditHold': '',
'IsPayScan': '',
'State': ''}}
Once you create the same data structure for the "new" billing file, you can easily find the differences:
# Keys that are in old_bills, but not new_bills
print set(old_bills.keys()) - set(new_bills.keys())
# Keys that are in new_bills, but not old_bills
print set(new_bills.keys()) - set(old_bills.keys())
# Compare columns for same billing records
# Will print True or False
print old_bills['2']['CustomerName'] == new_bills['2']['CustomerName']
print old_bills['2']['IsActive'] == new_bills['2']['IsActive']
Obviously, you wouldn't write a separate print statement for each comparison. I'm just demonstrating how to use the data structures to find differences. Next, you should write a function to loop through all possible BillingNumbers and check for differences between old and new...but I'll leave that part for you.
Upvotes: 1
Reputation: 14096
Reading your comment:
It's just something i'm trying to figure out. They put up a job listing for a new tech guy at work and the person they will hire has to solve this problem.
They very well may be looking for some command-line-fu. Something akin to
diff <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv1.csv) <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv2.csv)
A command that will work in bash, using the diff tool, to compare certain columns, selected using awk.
This is clearly not a python based solution. However, this solution does demonstrate the power of simple unix-based tools.
Upvotes: 0
Reputation: 8558
Do you have to write this yourself? If this is a programming exercise, all power to you. Otherwise, look for a tool called "diff" which probably exists in some form you already have access to. It's built into many other tools, such as text editors and like vim, emacs and notepad++, and version control systems like subversion mercurial and git.
I recommend you use the established workhorse instead of reinventing the wheel. git diff
is a beast.
Upvotes: 0