Reputation: 4983
I got txt file A with 300, 000+ lines and txt file B with 600, 000+ lines. Now what I want to do is to sift through file A line by line, if that line does not appear in file B then it will be appended to file C.
Well, the problem is if I program like what I said above, it literally takes ages to finish all the job. So is there a better way to do this?
Upvotes: 0
Views: 501
Reputation: 117771
This should be pretty fast:
with open("a.txt") as a:
with open("b.txt") as b:
with open("c.txt", "w") as c:
c.write("".join(set(a) - set(b)))
Note that this will disregard any order that was in A or B. If you absolutely need to keep the order from A you can use this:
with open("a.txt") as a:
with open("b.txt") as b:
with open("c.txt", "w") as c:
b_lines = set(b)
c.write("".join(line for line in a if not line in b_lines))
Upvotes: 14
Reputation: 70344
Read in all lines in file B into a set
:
blines = set(file_b)
for line in file_a:
if not line in blines:
append_to_file_c
600k+ is not really that much data...
Upvotes: 0
Reputation: 11567
Can you hold B in memory? If so, read file B and create an index with all the lines it contains. Then read A line by line and check for each line whether it appears in your index or not.
with open("B") as f:
B = set(f.readlines())
with open("A") as f:
for line in f.readlines():
if line not in B:
print(line)
Upvotes: 1
Reputation: 22340
Don't know anything about python, but: how about sorting the file A into a particular order? Then you can go through file B line by line and do a binary search - more efficient.
Upvotes: 0