Reputation: 275
I have 2 file: File A contains 200 000 lines; File B contains 4 000 000 lines. So, I want to compare these files and print the lines which aren't in File B.
For example: File A:
1
2
3
File B:
1
4
5
6
7
The output:
2
3
And the below is my code:
for line in open ( 'C:/A.txt' ):
if line not in open ( 'C:/B.txt' ):
print ( line )
This code works but it takes a very long time to complete. So, how to speed up the code process ?
Any help will be extremely appreciated ! :)
Upvotes: 1
Views: 106
Reputation: 1087
You could use the set difference
operation to get all the lines that do not match in these files.
with open('A.txt') as a:
contentA = set(a)
with open('B.txt') as b:
contentB = set(b)
print(contentA - contentB)
Edit: The reverse operation, to print contents of the file B which are not in A is now just
print(contentB - contentA)
Upvotes: 1
Reputation: 110726
Create a set just of the hashes of lines in file B - and compare the lines in A with those in this set -
Such a set will take about about one hundred megabytes of memory, therefore should fit in memory in a notebook or workstation:
linesB = {hash(line) for line in open("fileB"))}
for line in open("fileA"):
if hash(line) not in linesB:
print (line)
The main speed up here is that unlike searching for a line linearly inside fileB, it is read only once - and each line is made available in a set, which has constant look-up time. Therefore you come down from ~200,000 X 4,000,000 comparisons (O(m X n)) to just ~200.000 Comparisons (O(m X 1)).
That not to mention not needing to move data rom the filsystem into the program memory 200.000 times around.
By keeping only the hash
of lines in B you avoid having to keep all the text information of fileB in memory - just 24 bytes for each hash (in a 64bit system) - insteadof the textual information itself (which depends on each's lines lenght) + its hash.
Upvotes: 2
Reputation: 20366
A faster way would be to open the file once and use a set:
with open('C:/A.txt') as a:
with open('C:/B.txt') as b:
lines = set(b)
for line in a:
if line not in lines:
print(line)
Maybe a better way would be something like this:
with open('C:/A.txt') as a, open('C:/B.txt') as b:
lines = set()
for line in a:
if line not in lines:
for line_b in b:
lines.add(line_b)
if line_b == line:
break
else:
print(line)
Upvotes: 1