Inconsistency in diff results for equivalent changes

Question

Consider the following files and diff results:

a1.txt

a
b
My name is Ian

a2.txt

a
a
b
My name is John

Running diff --side-by-side --suppress-common-lines a1.txt a2.txt produces:

                             >  a
My name is Ian               |  My name is John

Which correctly states that a was added in a2.txt and My name is Ian changed to My name is John.

However, if I remove the b from both files, the produced results are different:

b1.txt

a
My name is Ian

b2.txt

a
a
My name is John

Running diff --side-by-side --suppress-common-lines b1.txt b2.txt produces:

My name is Ian                |  a
                              >  My name is John

This states that line My name is Ian changed to a and My name is John was added to b2.txt.

Even though the result of the second comparison is technically valid, the difference between a1.txt and a2.txt is equivalent to that of b1.txt and b2.txt, so why would the result not be equal?

Is there anything I can do such that the second comparison produces the same output as the first?

jub0bs · Accepted Answer

The discrepancy you observe between the two examples is normal; it just conflicts with your expectations of what diff does. The diff utility solves the longest-common-subsequence problem, using lines as units/atoms.

[...] the difference between a1.txt and a2.txt is equivalent to that of b1.txt and b2.txt, so why would the result not be equal?

Here, the longest common subsequences in your two examples are different and, roughly speaking, don't "line up" the same way. In the first example, you have

# a1.txt              # a2.txt                   # line in common?
                      a                          n
a                     a                          y 
b                     b                          y
My name is Ian        My name is John            n

whereas, in the second example, you have

# b1.txt              # b2.txt                   # line in common?
a                     a                          y
My name is Ian        a                          n
                      My name is John            n

Therefore, as far as diff is concerned, the differences between the two pairs of files are not equivalent. diff has no memory that all you did to obtain the b[12].txt files was to remove the b line from each of the a[12].txt files. All it sees is that the longest common subsequence now only consists in the one line that contains a, and it deduces the difference between the two b[12].txt files from that.

Is there anything I can do such that the second comparison produces the same output as the first?

Short of using a different diff algorithm (or implementing your own), I don't think so.

Inconsistency in diff results for equivalent changes

Answers (1)

Related Questions