Reputation: 844
I am really a beginner at python, but trying to compare some data that has been extracted from two databases into files. In the script I am using a dictionary for each database content and if I find a difference I add it to the dictionary. They keys being the combination of the first two values (code and subCode) and the value is a list of longCodes associated to that code/SubCode combination. Overall my script works, but wouldn't surprise me if its just horribly constructed and inefficient. Sample data that is processing is like:
0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215
The idea is that the data should be in sync, but sometimes it is not and I am trying to detect the differences. In reality when I extract data from the DBs there are over 1 million lines in each file. Performance is not that great it seems (maybe its as good as can be?), takes about 35 minutes to process and give me the results. If there are any suggestions for improving performance I will gladly accept!
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
elif line.startswith('+'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
f1.close()
f2.close()
Upvotes: 1
Views: 1175
Reputation: 844
So I ended up using different logic to come up with a much more efficient script. Big thanks to https://stackoverflow.com/users/100297/martijn-pieters for the assist.
#!/usr/bin/python
import csv, sys, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')
#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
reader1 = csv.reader(master)
master_rows = {tuple(r) for r in reader1}
with open('slaveDbCodes.lst', 'rb') as slave:
reader = csv.reader(slave)
for row in reader:
if tuple(row) not in master_rows:
code = row[0]
subCode = row[1]
codeSubCode = ",".join([code, subCode])
longCode = row[2]
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
reader1 = csv.reader(slave)
slave_rows = {tuple(r) for r in reader1}
with open('masterDbCodes.lst', 'rb') as master:
reader = csv.reader(master)
for row in reader:
if tuple(row) not in slave_rows:
code = row[0]
subCode = row[1]
codeSubCode = ",".join([code, subCode])
longCode = row[2]
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
This solution can process the data (in fact twice) in about 10 seconds.
Upvotes: 0
Reputation: 325
Try this:
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
sp=",".join(line.split(",", 2)[:2])
codeSubCode = sp
longCode = sp.rstrip()
try:
masterDb[codeSubCode].append(longCode)
except:
masterDb[codeSubCode] = [(longCode)]
elif line.startswith('+'):
line = line[2:]
sp=",".join(line.split(",", 2)[:2])
codeSubCode = sp
longCode = sp.rstrip()
try:
slaveDb[codeSubCode].append(longCode)
except:
slaveDb[codeSubCode] = [(longCode)]
f1.close()
f2.close()
Upvotes: 1