forcing ndiff on very dissimilar strings

Question

The ndiff function from difflib allows a nice interface to detect differences in lines. It does a great job when the lines are close enough:

>>> print '
'.join(list(ndiff(['foo*'], ['foot'], )))
- foo*
?    ^

+ foot
?    ^

But when the lines are too dissimilar, the rich reporting is no longer possible:

>>> print '
'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****

This is the use case I am hitting, and I am trying to find ways to use ndiff (or the underlying class Differ) to force the reporting even if the strings are too dissimilar.

For the failing example, I would like to have a result like:

>>> print '
'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****
?    +++++

Andrea Corbellini · Accepted Answer

The function responsible for printing the context (i.e. those lines starting with ?) is Differ._fancy_replace. That function works by checking whether the two lines are equal by at least 75% (see the cutoff variable). Unfortunately, that 75% cutoff is hard-coded and cannot be changed.

What I can suggest is to subclass Differ and provide a version of _fancy_replace that simply ignores the cutoff. Here it is:

from difflib import Differ, SequenceMatcher

class FullContextDiffer(Differ):

    def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
        """
        Copied and adapted from https://github.com/python/cpython/blob/3.6/Lib/difflib.py#L928
        """
        best_ratio = 0
        cruncher = SequenceMatcher(self.charjunk)

        for j in range(blo, bhi):
            bj = b[j]
            cruncher.set_seq2(bj)
            for i in range(alo, ahi):
                ai = a[i]
                if ai == bj:
                    continue
                cruncher.set_seq1(ai)
                if cruncher.real_quick_ratio() > best_ratio and \
                      cruncher.quick_ratio() > best_ratio and \
                      cruncher.ratio() > best_ratio:
                    best_ratio, best_i, best_j = cruncher.ratio(), i, j

        yield from self._fancy_helper(a, alo, best_i, b, blo, best_j)

        aelt, belt = a[best_i], b[best_j]

        atags = btags = ""
        cruncher.set_seqs(aelt, belt)
        for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
            la, lb = ai2 - ai1, bj2 - bj1
            if tag == 'replace':
                atags += '^' * la
                btags += '^' * lb
            elif tag == 'delete':
                atags += '-' * la
            elif tag == 'insert':
                btags += '+' * lb
            elif tag == 'equal':
                atags += ' ' * la
                btags += ' ' * lb
            else:
                raise ValueError('unknown tag %r' % (tag,))
        yield from self._qformat(aelt, belt, atags, btags)

        yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)

And here is an example of how it works:

a = [
    'foo',
    'bar',
    'foobar',
]

b = [
    'foo',
    'bar',
    'barfoo',
]

print('
'.join(FullContextDiffer().compare(a, b)))

# Output:
# 
#   foo
#   bar
# - foobar
# ?    ---
# 
# + barfoo
# ? +++

forcing ndiff on very dissimilar strings

Answers (2)

Related Questions