ssbsts
ssbsts

Reputation: 844

Find differences between two files very slow

I am really a beginner at python, but trying to compare some data that has been extracted from two databases into files. In the script I am using a dictionary for each database content and if I find a difference I add it to the dictionary. They keys being the combination of the first two values (code and subCode) and the value is a list of longCodes associated to that code/SubCode combination. Overall my script works, but wouldn't surprise me if its just horribly constructed and inefficient. Sample data that is processing is like:

0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215

The idea is that the data should be in sync, but sometimes it is not and I am trying to detect the differences. In reality when I extract data from the DBs there are over 1 million lines in each file. Performance is not that great it seems (maybe its as good as can be?), takes about 35 minutes to process and give me the results. If there are any suggestions for improving performance I will gladly accept!

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)
        elif line.startswith('+'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

f1.close()
f2.close()

Upvotes: 1

Views: 1175

Answers (2)

ssbsts
ssbsts

Reputation: 844

So I ended up using different logic to come up with a much more efficient script. Big thanks to https://stackoverflow.com/users/100297/martijn-pieters for the assist.

#!/usr/bin/python

import csv, sys, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')

#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
    reader1 = csv.reader(master)
    master_rows = {tuple(r) for r in reader1}

with open('slaveDbCodes.lst', 'rb') as slave:
    reader = csv.reader(slave)

    for row in reader:
        if tuple(row) not in master_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
    reader1 = csv.reader(slave)
    slave_rows = {tuple(r) for r in reader1}

with open('masterDbCodes.lst', 'rb') as master:
    reader = csv.reader(master)

    for row in reader:
        if tuple(row) not in slave_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)

This solution can process the data (in fact twice) in about 10 seconds.

Upvotes: 0

Abdelmajid ELHAMDAOUI
Abdelmajid ELHAMDAOUI

Reputation: 325

Try this:

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()
            try:
                masterDb[codeSubCode].append(longCode)
            except:
                masterDb[codeSubCode] = [(longCode)]
        elif line.startswith('+'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()               
            try:
                slaveDb[codeSubCode].append(longCode)
            except:
                slaveDb[codeSubCode] = [(longCode)]

f1.close()
f2.close()

Upvotes: 1

Related Questions