Reputation: 35760

How to find same lines in two large text files?

I'd like to compare two large text files(200M) to get the same lines of them.
How to do that in Python?

Upvotes: 0

Answers (4)

Trufa

Reputation: 40747

~~Disclaimer: I really have no idea how efficient this will be for 200Mb but it's worth the try I guess:~~

I have tried the following for two ~80mb files and the result was around 2.7 seconds in a 3GB Ram intel i3 machine.

f1 = open("one")
f2 = open("two")

print set(f1).intersection(f2)

Upvotes: 1

Raymond Hettinger

Reputation: 226734

Here's an example from the docs:

>>> from difflib import context_diff
>>> fromfile = open('before.py')
>>> tofile = open('tofile.py')
>>> for line in context_diff(fromfile, tofile, fromfile='before.py', tofile='after.py'):
        print line,

Upvotes: 0

Greg Hewgill

Reputation: 994817

You may be able to use the standard difflib module. The module offers several ways of creating difference deltas from various kinds of input.

Upvotes: 0

necromancer

Reputation: 24651

since they are just 200M, allocate enough memory, read them, sort the lines in ascending order for each, then iterate through both collections of lines in parallel like in a merge operation and delete those that only occur in one set.

preserve line numbers in the collections and sort them by line number after the above, if you want to output them in original order.

merge operation: keep one index for each collection, if lines at both indexes match, increment both indexes, otherwise delete the smaller line and increment just that index. if either index is past the last line, delete all remaining lines in the other collection.

optimization: use a hash to optimize comparisons a little bit; do the hash in the initial read

Upvotes: 1

How to find same lines in two large text files?

Answers (4)

Related Questions