Reputation: 169
I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours.
Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on doing it in bash or python.
Thanks a lot in advance.
[EDIT 1]: Sorry, when i say difference between two files i did not mean a diff. The result file is in a different format.
The logic is a bit like this:
File A has 297.599 lines File B has over 13 million lines
I pick the current line being read from FILE A, grep it on FILE B, and if the line is present in File B i will write it to the result file. By the way, File A and File B have different formats. Result file will have the format of File A.
[EDIT 2]: I was asked at work to create a bash solution ideally so that we don't have to install python on all the machines this has to run on.
This is my curent implementation:
#!/bin/bash
LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'`
LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'`
while read -r line; do
MATCH="$(grep $line $LAST_EXP)"
echo "line: $line, match: $MATCH"
# if not empty
if [ ! -z "$MATCH" ]
then
echo $MATCH >> result
fi
done < $LAST_TTP
This bash approach is taking over 10 hours to complete. Do you have any suggestions on how to make it more efficient in bash?
Thanks a lot in advance!
Upvotes: 5
Views: 7851
Reputation: 287865
You're probably looking in a list instead of a set, leading to an O(n²) performance. Try:
with open('b') as b:
blines = set(b)
with open('a') as a:
with open('result', 'w') as result:
for line in a:
if line not in blines:
result.write(line)
Assuming uniformly long (and not overly long lines), the performance of this implementation is in O(|A| + |B|)
(amortized, due to Python's set
being extremely fast). The memory demand is in O(|B|)
, but with a factor significantly greater than 1.
If the order of lines in the output does not matter, you can also sort both files and then compare them line-by-line. This will have a performance in the order of O(|A| log |A| + B log |B|)
. The memory demand will be in O(|A|+|B|)
, or more precisely, |A|
+ |B|
.
Upvotes: 9
Reputation: 2269
You can also use awk:
awk 'NR==FNR { arrA[$0]; next } $0 in arrA' file_A file_B > result
The order of files (... file_A file_B) in the command line is very important.
Upvotes: 0
Reputation: 360103
You can speed up your script slightly by stopping the grep
when it finds the first match if that's appropriate for your needs.
If your grep
supports it use grep -m 1
.
Your problem is that you're spawning grep
almost 300,000 times and it's reading over 13,000,000 lines each time.
Stopping grep
at the first match will help a little, but the overhead of all those execs is a big factor as well. An implementation in Python will eliminate this issue.
As for choosing the files in your script, please see BashFAQ/003 and Parsing ls
.
Upvotes: 2
Reputation: 45
The join command also does what you want. join requires both files to be sorted up front. The option -v prints a line for each unpairable line.
So you will want something like
join -v 1 sortedfile1 sortedfile2
(you will need to set the join options according to your file format, see manpage of join)
The below example joins the files test1.txt and test2.txt using the second resp. first field for the join. Assuming the files were sorted upfront using the sort command. The -v 1 option only outputs the lines of test1.txt could not be joined.
>cat test1.txt a 1 2 b 2 3 > cat test2.txt 1 x 4 x > join -v 1 -1 2 -2 1 test1.txt test2.txt 2 b 3 > join -v 1 -1 2 -2 1 -o 1.1 1.2 1.3 test1.txt test2.txt b 2 3
sort and join both work fine with large files.
Upvotes: 2
Reputation: 11473
I would consider the comm command. It should be faster than grep because it will work with sorted data while grep will always do a linear search
comm -2 -3 <(sort file1) <(sort file2)
Upvotes: 1
Reputation: 16368
Your implementation seems to do:
grep --fixed-strings --file=file_B file_A > result_file
But both @phihag's and @mark Ronsam's answers are better solutions.
Also, if you wish to use the heavy guns, your solution is a good candidate for map-reduce frameworks such as hadoop.
Upvotes: 2
Reputation: 308206
Sort each of the input files. Now read one line from each. If one line compares less than the other, output it as a difference and read the next line from that file. If both lines compare equal, read the next line from both files. If you read to the end of one file, all the lines in the other file are differences.
This is an O(n log n) algorithm as opposed to the O(n^2) algorithm you started with.
Upvotes: 4