Reputation: 35720
I'd like to compare two large text files(200M) to get the same lines of them.
How to do that in Python?
Upvotes: 0
Views: 1096
Reputation: 40727
Disclaimer: I really have no idea how efficient this will be for 200Mb but it's worth the try I guess:
I have tried the following for two ~80mb files and the result was around 2.7 seconds in a 3GB Ram intel i3 machine.
f1 = open("one")
f2 = open("two")
print set(f1).intersection(f2)
Upvotes: 1
Reputation: 226256
Here's an example from the docs:
>>> from difflib import context_diff
>>> fromfile = open('before.py')
>>> tofile = open('tofile.py')
>>> for line in context_diff(fromfile, tofile, fromfile='before.py', tofile='after.py'):
print line,
Upvotes: 0
Reputation: 992955
You may be able to use the standard difflib module. The module offers several ways of creating difference deltas from various kinds of input.
Upvotes: 0
Reputation: 24641
since they are just 200M, allocate enough memory, read them, sort the lines in ascending order for each, then iterate through both collections of lines in parallel like in a merge operation and delete those that only occur in one set.
preserve line numbers in the collections and sort them by line number after the above, if you want to output them in original order.
merge operation: keep one index for each collection, if lines at both indexes match, increment both indexes, otherwise delete the smaller line and increment just that index. if either index is past the last line, delete all remaining lines in the other collection.
optimization: use a hash to optimize comparisons a little bit; do the hash in the initial read
Upvotes: 1