Reputation: 37
Basically, I wanted to create a Python script for my daily tasks wherein i wanted to compare two file with any size & wanted to generated 2 new files having matching records & non-matching records from both file.
I have written below python script & found it's working properly for file size having few records.
But when i am executing same script with files with 200,000 and 500,000 records then resulting file getting generated is not giving valid output.
So, can you check below script and help to identify issue in it causing wrong output...?
Thanks in advance.
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
Edit-1 : Pls note that above program executes without any error. Its just that output is incorrect and program takes much longer time to get over large file.
Upvotes: 0
Views: 62
Reputation: 140168
I'll take a wild guess and assume that "no valid output" means: "runs forever and does nothing useful".
Which would be logical because of your list comprehensions:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
They perform O(n)
lookup, which is okay on a small number of lines but never ends if, say len(file1) == 100000
and so is file2
. You then perform 100000*100000 iterations => 10**10 => forever.
Fix is simple: create sets
and use intersection
& difference
, much faster.
file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)
Upvotes: 1