AJNEO999
AJNEO999

Reputation: 37

python - Issue in processing files with big size

Basically, I wanted to create a Python script for my daily tasks wherein i wanted to compare two file with any size & wanted to generated 2 new files having matching records & non-matching records from both file.

I have written below python script & found it's working properly for file size having few records.

But when i am executing same script with files with 200,000 and 500,000 records then resulting file getting generated is not giving valid output.

So, can you check below script and help to identify issue in it causing wrong output...?

Thanks in advance.

from sys import argv

script, filePathName1, filePathName2  = argv

def FileDifference(filePathName1, filePathName2):
    fileObject1 = open(filePathName1,'r')
    fileObject2 = open(filePathName2,'r')
    newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
    newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
    newFileObject1 = open(newFilePathName1,'a')
    newFileObject2 = open(newFilePathName2,'a')
    file1 = fileObject1.readlines()
    file2 = fileObject2.readlines()
    Differece = [ diff for diff in file1 if diff not in file2 ]
    for i in range(0,len(Differece)):
        newFileObject1.write(Differece[i])

    Matching = [ match for match in file1 if match in file2 ]
    for j in range(0,len(Matching)):
        newFileObject2.write(Matching[j])
    fileObject1.close()
    fileObject2.close()
    newFileObject1.close()
    newFileObject2.close()

FileDifference(filePathName1, filePathName2)

Edit-1 : Pls note that above program executes without any error. Its just that output is incorrect and program takes much longer time to get over large file.

Upvotes: 0

Views: 62

Answers (1)

Jean-François Fabre
Jean-François Fabre

Reputation: 140168

I'll take a wild guess and assume that "no valid output" means: "runs forever and does nothing useful".

Which would be logical because of your list comprehensions:

    Differece = [ diff for diff in file1 if diff not in file2 ]
    for i in range(0,len(Differece)):
        newFileObject1.write(Differece[i])

Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
    newFileObject2.write(Matching[i])

They perform O(n) lookup, which is okay on a small number of lines but never ends if, say len(file1) == 100000 and so is file2. You then perform 100000*100000 iterations => 10**10 => forever.

Fix is simple: create sets and use intersection & difference, much faster.

    file1 = set(fileObject1.readlines())
    file2 = set(fileObject2.readlines())
    difference = file1 - file2
    for i in difference:
        newFileObject1.write(i)

matching = file1 & file2
for i in matching:
    newFileObject2.write(matching)

Upvotes: 1

Related Questions