Yanick Nedderhoff
Yanick Nedderhoff

Reputation: 1244

Difference between two files without sorting

I have the files file1 and file2, where file2 is a subset of file1. That means, if I iterate over file1, there are some lines that are in file2, and some that aren't, but there is no line in file2 that is not in file1. There may be several lines with the same content in a file. Now I want to get the difference between them, that is, all lines of file1 that aren't in file2.

According to this well received answer

diff(1) isn't the answer, comm(1) is.

(For whatever reason)

But as I understand, for comm the files need to be sorted first. The problem: Both files are ordered (not sorted!), and this order needs to be kept. So what I really want is to iterate over file1, and check for every line, if it is also in file2. If not, write it to file3. If the same content occurs more than once, it should be kept more than once!

Is there any way to do this with the command line?

Upvotes: 4

Views: 3911

Answers (2)

Cyrus
Cyrus

Reputation: 88979

Try this with GNU grep:

grep -vFf file2 file1 > file3

Update:

grep -vxFf file2 file1 > file3

Upvotes: 5

Walter A
Walter A

Reputation: 20032

I think you do not want to sort for avoiding temp files. This is possible with process substitution:

diff <(sort file1) <(sort file2)
# or
comm <(sort file1) <(sort file2)

Edit: Using https://stackoverflow.com/a/4544925/3220113 I found another alternative (for text files with short lines):

diff -a --suppress-common-lines -y file2 file1 | sed 's/\s*>.//'

Upvotes: 0

Related Questions