Reputation: 169

Comparing two large files

I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.

Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.

Thanks a lot in advance.

[EDIT 1]: Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.

The logic is a bit like this:

File A has 297.599 lines File B has over 13 million lines

I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.

[EDIT 2]: I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.

This is my curent implementation:

#!/bin/bash

LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'`

while read -r line; do
   MATCH="$(grep $line $LAST_EXP)"
   echo "line: $line, match: $MATCH"

   # if not empty
   if [ ! -z "$MATCH" ]
   then
      echo $MATCH >> result
   fi

done < $LAST_TTP

This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?

Thanks a lot in advance!

Upvotes: 5

Answers (7)

phihag

Reputation: 287865

You're probably looking in a list instead of a set, leading to an O(n²) performance. Try:

with open('b') as b:
  blines = set(b)
with open('a') as a:
  with open('result', 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|) (amortized, due to Python's set being extremely fast). The memory demand is in O(|B|), but with a factor significantly greater than 1.

If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|). The memory demand will be in O(|A|+|B|), or more precisely, |A| + |B|.

Upvotes: 9

lind

Reputation: 2269

You can also use awk:

awk 'NR==FNR { arrA[$0]; next } $0 in arrA' file_A file_B > result

The order of files (... file_A file_B) in the command line is very important.

Upvotes: 0

Dennis Williamson

Reputation: 360103

You can speed up your script slightly by stopping the grep when it finds the first match if that's appropriate for your needs.

If your grep supports it use grep -m 1.

Your problem is that you're spawning grep almost 300,000 times and it's reading over 13,000,000 lines each time.

Stopping grep at the first match will help a little, but the overhead of all those execs is a big factor as well. An implementation in Python will eliminate this issue.

As for choosing the files in your script, please see BashFAQ/003 and Parsing ls.

Upvotes: 2

Fabian Büttner

Reputation: 45

The join command also does what you want. join requires both files to be sorted up front. The option -v prints a line for each unpairable line.

So you will want something like

join -v 1 sortedfile1 sortedfile2

(you will need to set the join options according to your file format, see manpage of join)

The below example joins the files test1.txt and test2.txt using the second resp. first field for the join. Assuming the files were sorted upfront using the sort command. The -v 1 option only outputs the lines of test1.txt could not be joined.

>cat test1.txt
a 1 2
b 2 3

> cat test2.txt
1 x
4 x

> join -v 1  -1 2  -2 1  test1.txt test2.txt
2 b 3

> join -v 1 -1 2 -2 1 -o 1.1 1.2 1.3 test1.txt test2.txt
b 2 3

sort and join both work fine with large files.

Upvotes: 2

frankc

Reputation: 11473

I would consider the comm command. It should be faster than grep because it will work with sorted data while grep will always do a linear search

comm -2 -3 <(sort file1) <(sort file2)

Upvotes: 1

Chen Levy

Reputation: 16368

Your implementation seems to do:

grep --fixed-strings --file=file_B file_A > result_file

But both @phihag's and @mark Ronsam's answers are better solutions.

Also, if you wish to use the heavy guns, your solution is a good candidate for map-reduce frameworks such as hadoop.

Upvotes: 2

Mark Ransom

Reputation: 308206

Sort each of the input files. Now read one line from each. If one line compares less than the other, output it as a difference and read the next line from that file. If both lines compare equal, read the next line from both files. If you read to the end of one file, all the lines in the other file are differences.

This is an O(n log n) algorithm as opposed to the O(n^2) algorithm you started with.

Upvotes: 4

Comparing two large files

Answers (7)

Related Questions