Reputation: 1244
I have the files file1
and file2
, where file2
is a subset of file1
. That means, if I iterate over file1
, there are some lines that are in file2
, and some that aren't, but there is no line in file2
that is not in file1
. There may be several lines with the same content in a file. Now I want to get the difference between them, that is, all lines of file1
that aren't in file2
.
According to this well received answer
diff(1) isn't the answer, comm(1) is.
(For whatever reason)
But as I understand, for comm
the files need to be sorted first. The problem: Both files are ordered (not sorted!), and this order needs to be kept. So what I really want is to iterate over file1
, and check for every line, if it is also in file2
. If not, write it to file3
. If the same content occurs more than once, it should be kept more than once!
Is there any way to do this with the command line?
Upvotes: 4
Views: 3911
Reputation: 88979
Try this with GNU grep:
grep -vFf file2 file1 > file3
Update:
grep -vxFf file2 file1 > file3
Upvotes: 5
Reputation: 20032
I think you do not want to sort for avoiding temp files. This is possible with process substitution:
diff <(sort file1) <(sort file2)
# or
comm <(sort file1) <(sort file2)
Edit: Using https://stackoverflow.com/a/4544925/3220113 I found another alternative (for text files with short lines):
diff -a --suppress-common-lines -y file2 file1 | sed 's/\s*>.//'
Upvotes: 0