TheFoxx
TheFoxx

Reputation: 1693

How to efficiently iterate two files in python?

I have two text files which should have a lot of matching lines and I want to find out exactly how many lines match between the files. The problem is that both of the files are quite large (one file is about 3gb and the other is over 16gb). So obviously reading them into the system memory using read() or readlines() could be very problematic. Any tips? The code I'm writing is basically just a 2 loops and an if statement to compare them.

Upvotes: 4

Views: 204

Answers (3)

TheFoxx
TheFoxx

Reputation: 1693

Well thanks all for your input! But what I ended up doing was painfully simple. I was trying things like this, which read in the whole file.

file = open(xxx,"r")
for line in file:
      if.....

What I ended up doing was

for line in open(xxx)
    if.....

The second one takes the file line by line. It's very time consuming, but I've pretty much accepted that there isn't some magically way to do this that will take very little time :(

Upvotes: 0

Srikar Appalaraju
Srikar Appalaraju

Reputation: 73638

why not use unix grep? if you want your solution platform independent then this solution will not work. But in unix it works. Run this command from your python script.

grep --fixed-strings --file=file_B file_A > result_file

Also this problem seems to be a good reason to go for map-reduce.

UPDATE 0: To elucidate. --fixed-strings = Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. and --file= Obtain patterns from FILE, one per line.

So what we ar doing is getting patterns from file_B matched against the content in file_A and fixed-strings treats them as a sequence of patterns the way they are in a file. Hope this makes it clearer.

Since you want the count of matching lines a slight modification of the above grep we get the count -

grep --fixed-strings --file=file_B file_A | wc -l

UPDATE 1: You could you do this. first go through each file separately line by line. dont read the entire file into memory. when you read one line compute md5 hash of this line and write it to another file. When you do this 2 both files, you get 2 new files filled with md5 hashes. I am hoping that these 2 files are substantially smaller in size the original files, since md5 is 16bytes irrespective of i/p string. Now you can probably do grep or other diffing techniques with little or no memory problem. – Srikar 3 mins ago edit

UPDATE 2: (few days later) Can you do this? create 2 tables table1, table2 in mysql. Both having only 2 fields id, data. Insert both the files into both these tables, line by line. After which run a query to find count of duplicates. You have to go through both files. Thats given. We cant run away from that fact. Now the optimisations can be done in how dups are found. MySQL is one such option. It removes a lot of things that you need to do like RAM space, index creation etc.

Upvotes: 1

John Zwinck
John Zwinck

Reputation: 249223

Since the input files are very large, if you care about performance, you should consider simply using grep -f. The -f option reads patterns from a file, so depending on the exact semantics you're after, it may do what you need. You probably want the -x option too, to take only whole-line matches. So the whole thing in Python might look something like this:

child = subprocess.Popen(['grep', '-xf', file1, file2], stdout=subprocess.PIPE)
for line in child.stdout:
    print line

Upvotes: 2

Related Questions