NGuyen
NGuyen

Reputation: 275

Python: The fatest way to compare 2 text file?

I have 2 file: File A contains 200 000 lines; File B contains 4 000 000 lines. So, I want to compare these files and print the lines which aren't in File B.

For example: File A:

   1
   2
   3

File B:

   1
   4
   5
   6
   7

The output:

   2
   3

And the below is my code:

for line in open ( 'C:/A.txt' ):
    if line not in open ( 'C:/B.txt' ):
        print ( line )

This code works but it takes a very long time to complete. So, how to speed up the code process ?

Any help will be extremely appreciated ! :)

Upvotes: 1

Views: 106

Answers (3)

Sreejith Menon
Sreejith Menon

Reputation: 1087

You could use the set difference operation to get all the lines that do not match in these files.

with open('A.txt') as a:
    contentA = set(a)

with open('B.txt') as b:
    contentB = set(b)

print(contentA - contentB)

Edit: The reverse operation, to print contents of the file B which are not in A is now just

print(contentB - contentA)

Upvotes: 1

jsbueno
jsbueno

Reputation: 110726

Create a set just of the hashes of lines in file B - and compare the lines in A with those in this set -

Such a set will take about about one hundred megabytes of memory, therefore should fit in memory in a notebook or workstation:

linesB = {hash(line) for line in open("fileB"))}
for line in open("fileA"):
    if hash(line) not in linesB:
         print (line)

The main speed up here is that unlike searching for a line linearly inside fileB, it is read only once - and each line is made available in a set, which has constant look-up time. Therefore you come down from ~200,000 X 4,000,000 comparisons (O(m X n)) to just ~200.000 Comparisons (O(m X 1)).

That not to mention not needing to move data rom the filsystem into the program memory 200.000 times around.

By keeping only the hash of lines in B you avoid having to keep all the text information of fileB in memory - just 24 bytes for each hash (in a 64bit system) - insteadof the textual information itself (which depends on each's lines lenght) + its hash.

Upvotes: 2

zondo
zondo

Reputation: 20366

A faster way would be to open the file once and use a set:

with open('C:/A.txt') as a:
    with open('C:/B.txt') as b:
        lines = set(b)
    for line in a:
        if line not in lines:
            print(line)

Maybe a better way would be something like this:

with open('C:/A.txt') as a, open('C:/B.txt') as b:
    lines = set()
    for line in a:
        if line not in lines:
            for line_b in b:
                lines.add(line_b)
                if line_b == line:
                    break
            else:
                print(line)

Upvotes: 1

Related Questions