Reputation: 111

Fast diff of 2 large text files in shell?

I have 2 large files (F1 and F2) with 200k+ rows each, and currently I am comparing each record in F1 against F2 to look for records unique only to F1, then comparing F2 to F1 to look for records unique only to F2.

I am doing this by reading in each line of the file using a 'while' loop then using 'grep' on the line against the file to see if a match is found.

This process takes about 3 hours to complete if there are no mismatches, and can be 6+ hours if there are a large number of mismatches (files barely matching so 200k+ mismatches).

Is there any way I can rewrite this script to accomplish the same function but in a faster time?

I have tried to rewrite the script using sed to try to delete the line in F2 if a match is found so that when comparing F2 to F1, only the values unique to F2 remain, however calling sed for every iteration of F1's lines does not seem to improve the performance much.

Example:

F1 contains:

A
B
E
F

F2 contains:

A
Y
B
Z

The output I'm expecting is when comparing F1 to F2:

E
F

And then comparing F2 to F1:

Y
Z

Upvotes: 2

Answers (4)

stevesliva

Reputation: 5655

Grep is going to use compiled code to do the entirety of what you want if you simply treat one or the other of your files as a pattern file.

grep -vFx -f F1.txt F2.txt:

Y
Z

grep -vFx -f F2.txt F1.txt:

E
F

Explanation:

-v to print lines not matching those in the "pattern file" specified with -f
-F - interpret patterns as fixed strings and not regexes, gleaned from this question, which I was reading to see if there was a practical limit to this. I am curious whether it will work with large line counts in both files.
-x - match entire lines
Sorting is not required. - You get the resulting unique lines in the order they appear. This method takes longer because it cannot assume the inputs are sorted, but if you are looking at multiline records, sorting really trashes the context. The performance is okay if the files are similar, because grep -v skips a line as soon as it matches any line in the "pattern" file. If the files are highly dissimilar, the performance is very slow, because it's checking every pattern vs every line before finally printing it.

Upvotes: 0

bishop

Reputation: 39374

You want comm:

$ cat f1
A
B
E
F
$ cat f2
A
Y
B
Z
$ comm <(sort f1) <(sort f2)
                A
                B
E
F
        Y
        Z

Column 1 of comm output are those lines unique to f1. Column 2 are those lines unique to f2. Column 3 are lines found in both f1 and f2.

The parameters -1, -2, and -3 suppress the corresponding output. For example, if you want only the lines unique to f1, you can filter out the other columns:

$ comm -23 <(sort f1) <(sort f2)
E
F

Note that comm requires sorted input, which I supply in these examples using the bash command substitution syntax (<()). If you're not using bash, pre-sort into a temporary file.

Upvotes: 3

Lars Fischer

Reputation: 10149

If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do:

sort F1 F2 | uniq -u

Upvotes: 1

Jon Malachowski

Reputation: 256

Have you tried linux's diff? Some useful options are -i, -w, -u, -y

Though, in that case, they'd have to have the same order (you could sort them first)

Upvotes: 1

Fast diff of 2 large text files in shell?

Answers (4)

Related Questions