Reputation: 111
I have 2 large files (F1 and F2) with 200k+ rows each, and currently I am comparing each record in F1 against F2 to look for records unique only to F1, then comparing F2 to F1 to look for records unique only to F2.
I am doing this by reading in each line of the file using a 'while' loop then using 'grep' on the line against the file to see if a match is found.
This process takes about 3 hours to complete if there are no mismatches, and can be 6+ hours if there are a large number of mismatches (files barely matching so 200k+ mismatches).
Is there any way I can rewrite this script to accomplish the same function but in a faster time?
I have tried to rewrite the script using sed to try to delete the line in F2 if a match is found so that when comparing F2 to F1, only the values unique to F2 remain, however calling sed for every iteration of F1's lines does not seem to improve the performance much.
Example:
F1 contains:
A
B
E
F
F2 contains:
A
Y
B
Z
The output I'm expecting is when comparing F1 to F2:
E
F
And then comparing F2 to F1:
Y
Z
Upvotes: 2
Views: 1887
Reputation: 5655
Grep is going to use compiled code to do the entirety of what you want if you simply treat one or the other of your files as a pattern file.
grep -vFx -f F1.txt F2.txt
:
Y
Z
grep -vFx -f F2.txt F1.txt
:
E
F
Explanation:
-v
to print lines not matching those in the "pattern file"
specified with -f
-F
- interpret patterns as fixed strings and not regexes, gleaned
from this
question, which I was reading to see if there was a practical limit to this. I am curious whether it will work with large line counts in both files.
-x
- match entire linesgrep -v
skips a line as soon as it matches any line in the "pattern" file. If the files are highly dissimilar, the performance is very slow, because it's checking every pattern vs every line before finally printing it.Upvotes: 0
Reputation: 39374
You want comm:
$ cat f1
A
B
E
F
$ cat f2
A
Y
B
Z
$ comm <(sort f1) <(sort f2)
A
B
E
F
Y
Z
Column 1 of comm
output are those lines unique to f1
. Column 2 are those lines unique to f2
. Column 3 are lines found in both f1
and f2
.
The parameters -1
, -2
, and -3
suppress the corresponding output. For example, if you want only the lines unique to f1
, you can filter out the other columns:
$ comm -23 <(sort f1) <(sort f2)
E
F
Note that comm
requires sorted input, which I supply in these examples using the bash command substitution syntax (<()
). If you're not using bash, pre-sort into a temporary file.
Upvotes: 3
Reputation: 10149
If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do:
sort F1 F2 | uniq -u
Upvotes: 1
Reputation: 256
Have you tried linux's diff? Some useful options are -i, -w, -u, -y
Though, in that case, they'd have to have the same order (you could sort them first)
Upvotes: 1