user1441004
user1441004

Reputation: 446

difflib - ignore whitespace diffs w/ ndiff()?

I looked at some answers to similar questions here, but I guess I am still not understanding something about the way difflib.ndiff() works?

I am looking at ndiff in particular because the documentation implies that, by default, the diff would be ignoring whitespace changes.

Here's a simple program where I would expect the lines in the Differ (i.e,. the return value from difflib.ndiff()) to be empty:

import difflib

# a simple set of lines
A_LINES = [
    'Line 1',
    'Line 2',
]

# should be same as A_LINES if whitespace is ignored
B_LINES = [
    '  Line 1',
    '  Line 2',
]

def test_2(a, b):
    # differ = difflib.ndiff(a, b)
    differ = difflib.ndiff(a, b, charjunk=difflib.IS_CHARACTER_JUNK)
    for line in differ:
        print(line)

def main(a_fn, b_fn):
    test_2(A_LINES, B_LINES)


if __name__ == '__main__':
    main()

difflib.IS_CHARACTER_JUNK() seems to just be a predicate that returns True on ' ' and '\t', False otherwise. Whether you invoke ndiff() by explicitly calling out the IS_CHARACTER_JUNK, or accept the default and not mention the charjunk argument, I get the same output:

- Line 1
+   Line 1
? ++

- Line 2
+   Line 2
? ++

That's not the output I would expect for a diff that is ignoring whitespace. It seems very unexpected to me, given the documentation for ndiff (see: https://docs.python.org/3/library/difflib.html). Is the documentation off, or strange, or wrong, or am I just not understanding something?

How would I call ndiff() such that there are no lines in the 'differ' generator for this example?

Any help better understanding how to do "ignore whitespace"-type diffs greatly appreciated.

Upvotes: 4

Views: 3456

Answers (2)

Erik
Erik

Reputation: 21

From looking at the source code at https://github.com/python/cpython/blob/3.10/Lib/difflib.py, I get that both linejunk and charjunk are used, but in such a way that char junk has no effect. It seems more a flaw in the logic than a bug per se.

Short answer: no, it is not possible to call ndiff() in such a way that there are no lines in the differ generator in your example.

The way it works is as follows:

ndiff() delegates to Differ().compare() and passes both linkjunk and charjunk to Differ, which stores them.

Differ.compare first determines differences using SequenceMatcher, passing only the linejunk function.

When it prints the diffs, it uses _fancy_replace for single line diffs, using '^'. This uses SequenceMatcher (again) but this time passing the charjunk function.

For complete line diffs however it just prints '+' or '-', never calling SequenceMatcher again.

So, for complete line diffs, charjunk is never used.

If _fancy_replace indeed ignores whitespace when printing single line diffs we will never know, because when there is a whitespace difference, the first pass with SequenceMatcher will generate full line diffs and _fancy_replace never gets called.

In short: the first call to SequenceMatcher preempts the use of charjunk when SequenceMatcher is called for the second time because it will generate a line diff (with + and -) and no fancy diff.

No, the documentation is not clear about this.

I hope this improves your understanding.

I did not find any other way of doing what you asked for, using difflib, apart from rewriting large portions of it.

Upvotes: 2

Jean-François Fabre
Jean-François Fabre

Reputation: 140307

It seems that the IS_CHARACTER_JUNK filter function is called but doesn't have any effect in filtering the junk chars. Looks like a bug to me. Python 3.6 difflib still behaves the same.

I can propose an acceptable workaround for now: remove trailing and leading spaces from lines, and replace all repeated spaces by a single space. That provides an exploitable output (removing all spaces would be unpractical) at least.

import difflib
import re

lines1 = ["foo bar ","cat ","nope"]
lines2 = ["foo  bar   ","hello","cat    "]

def prefilter(line):
    return re.sub("\s+"," ",line.strip())

for d in difflib.ndiff([prefilter(x) for x in lines1],[prefilter(x) for x in lines2]):
    print(d)

result (only added/removed lines appear as changes, lines with spaces added/removed don't)

  foo bar
+ hello
  cat
- nope

Upvotes: 2

Related Questions