wong2
wong2

Reputation: 35720

How to find same lines in two large text files?

I'd like to compare two large text files(200M) to get the same lines of them.
How to do that in Python?

Upvotes: 0

Views: 1096

Answers (4)

Trufa
Trufa

Reputation: 40727

Disclaimer: I really have no idea how efficient this will be for 200Mb but it's worth the try I guess:

I have tried the following for two ~80mb files and the result was around 2.7 seconds in a 3GB Ram intel i3 machine.

f1 = open("one")
f2 = open("two")

print set(f1).intersection(f2)

Upvotes: 1

Raymond Hettinger
Raymond Hettinger

Reputation: 226256

Here's an example from the docs:

>>> from difflib import context_diff
>>> fromfile = open('before.py')
>>> tofile = open('tofile.py')
>>> for line in context_diff(fromfile, tofile, fromfile='before.py', tofile='after.py'):
        print line,

Upvotes: 0

Greg Hewgill
Greg Hewgill

Reputation: 992955

You may be able to use the standard difflib module. The module offers several ways of creating difference deltas from various kinds of input.

Upvotes: 0

necromancer
necromancer

Reputation: 24641

since they are just 200M, allocate enough memory, read them, sort the lines in ascending order for each, then iterate through both collections of lines in parallel like in a merge operation and delete those that only occur in one set.

preserve line numbers in the collections and sort them by line number after the above, if you want to output them in original order.

merge operation: keep one index for each collection, if lines at both indexes match, increment both indexes, otherwise delete the smaller line and increment just that index. if either index is past the last line, delete all remaining lines in the other collection.

optimization: use a hash to optimize comparisons a little bit; do the hash in the initial read

Upvotes: 1

Related Questions