Shane
Shane

Reputation: 4983

What's the fastest way to find unique lines from huge file A as compared to huge file B using python?

I got txt file A with 300, 000+ lines and txt file B with 600, 000+ lines. Now what I want to do is to sift through file A line by line, if that line does not appear in file B then it will be appended to file C.

Well, the problem is if I program like what I said above, it literally takes ages to finish all the job. So is there a better way to do this?

Upvotes: 0

Views: 501

Answers (4)

orlp
orlp

Reputation: 117771

This should be pretty fast:

with open("a.txt") as a:
    with open("b.txt") as b:
        with open("c.txt", "w") as c:
            c.write("".join(set(a) - set(b)))

Note that this will disregard any order that was in A or B. If you absolutely need to keep the order from A you can use this:

with open("a.txt") as a:
    with open("b.txt") as b:
        with open("c.txt", "w") as c:
            b_lines = set(b)
            c.write("".join(line for line in a if not line in b_lines))

Upvotes: 14

Daren Thomas
Daren Thomas

Reputation: 70344

Read in all lines in file B into a set:

blines = set(file_b)
for line in file_a:
    if not line in blines:
       append_to_file_c

600k+ is not really that much data...

Upvotes: 0

user1202136
user1202136

Reputation: 11567

Can you hold B in memory? If so, read file B and create an index with all the lines it contains. Then read A line by line and check for each line whether it appears in your index or not.

with open("B") as f:
    B = set(f.readlines())

with open("A") as f:
    for line in f.readlines():
        if line not in B:
           print(line)

Upvotes: 1

Hammerite
Hammerite

Reputation: 22340

Don't know anything about python, but: how about sorting the file A into a particular order? Then you can go through file B line by line and do a binary search - more efficient.

Upvotes: 0

Related Questions