alfredodeza
alfredodeza

Reputation: 5198

forcing ndiff on very dissimilar strings

The ndiff function from difflib allows a nice interface to detect differences in lines. It does a great job when the lines are close enough:

>>> print '\n'.join(list(ndiff(['foo*'], ['foot'], )))
- foo*
?    ^

+ foot
?    ^

But when the lines are too dissimilar, the rich reporting is no longer possible:

>>> print '\n'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****

This is the use case I am hitting, and I am trying to find ways to use ndiff (or the underlying class Differ) to force the reporting even if the strings are too dissimilar.

For the failing example, I would like to have a result like:

>>> print '\n'.join(list(ndiff(['foo'], ['foo*****'], )))
- foo
+ foo*****
?    +++++

Upvotes: 2

Views: 184

Answers (2)

Olivier Melançon
Olivier Melançon

Reputation: 22324

It seems what you want to do here is not to compare across multiple lines, but across strings. You can then pass your strings directly, without a list, and you should get a behaviour close to the one you are looking for.

>>> print ('\n'.join(list(ndiff('foo', 'foo*****'))))
  f
  o
  o
+ *
+ *
+ *
+ *
+ *

Even though the output format is not the exact one you are looking for, it encapsulate the correct information. We can make an output adapter to give the correct format.

def adapter(out):
    chars = []
    symbols = []

    for c in out:
        chars.append(c[2])
        symbols.append(c[0])

    return ''.join(chars), ''.join(symbols)

This can be used like so.

>>> print ('\n'.join(adapter(ndiff('foo', 'foo*****'))))
foo*****
   +++++

Upvotes: 0

Andrea Corbellini
Andrea Corbellini

Reputation: 17781

The function responsible for printing the context (i.e. those lines starting with ?) is Differ._fancy_replace. That function works by checking whether the two lines are equal by at least 75% (see the cutoff variable). Unfortunately, that 75% cutoff is hard-coded and cannot be changed.

What I can suggest is to subclass Differ and provide a version of _fancy_replace that simply ignores the cutoff. Here it is:

from difflib import Differ, SequenceMatcher

class FullContextDiffer(Differ):

    def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
        """
        Copied and adapted from https://github.com/python/cpython/blob/3.6/Lib/difflib.py#L928
        """
        best_ratio = 0
        cruncher = SequenceMatcher(self.charjunk)

        for j in range(blo, bhi):
            bj = b[j]
            cruncher.set_seq2(bj)
            for i in range(alo, ahi):
                ai = a[i]
                if ai == bj:
                    continue
                cruncher.set_seq1(ai)
                if cruncher.real_quick_ratio() > best_ratio and \
                      cruncher.quick_ratio() > best_ratio and \
                      cruncher.ratio() > best_ratio:
                    best_ratio, best_i, best_j = cruncher.ratio(), i, j

        yield from self._fancy_helper(a, alo, best_i, b, blo, best_j)

        aelt, belt = a[best_i], b[best_j]

        atags = btags = ""
        cruncher.set_seqs(aelt, belt)
        for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
            la, lb = ai2 - ai1, bj2 - bj1
            if tag == 'replace':
                atags += '^' * la
                btags += '^' * lb
            elif tag == 'delete':
                atags += '-' * la
            elif tag == 'insert':
                btags += '+' * lb
            elif tag == 'equal':
                atags += ' ' * la
                btags += ' ' * lb
            else:
                raise ValueError('unknown tag %r' % (tag,))
        yield from self._qformat(aelt, belt, atags, btags)

        yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)

And here is an example of how it works:

a = [
    'foo',
    'bar',
    'foobar',
]

b = [
    'foo',
    'bar',
    'barfoo',
]

print('\n'.join(FullContextDiffer().compare(a, b)))

# Output:
# 
#   foo
#   bar
# - foobar
# ?    ---
# 
# + barfoo
# ? +++

Upvotes: 1

Related Questions