goldfarb33
goldfarb33

Reputation: 75

Using difflib.diff_bytes to compare two files in python

Let's say I want to compare file a and file b with the difflib.diff_bytes function, how would I do this?

Thanks

Upvotes: 6

Views: 4257

Answers (1)

fedepad
fedepad

Reputation: 4609

In the following I will assume you have Python 3.x (specifically 3.5).
Let's analyze the documentation to try to understand the function:

difflib.diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n')
Compare a and b (lists of bytes objects) using dfunc; yield a sequence of delta lines (also bytes) in the format returned by dfunc. dfunc must be a callable, typically either unified_diff() or context_diff().

Allows you to compare data with unknown or inconsistent encoding. All inputs except n must be bytes objects, not str. Works by losslessly converting all inputs (except n) to str, and calling dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm). The output of dfunc is then converted back to bytes, so the delta lines that you receive have the same unknown/inconsistent encodings as a and b.

First thing to notice is the distinction done between bytes objects and str(ing) objects. Then every input arguments except n must bytes objects.

So the key is that you use this function and pass byte objects to it, not strings. So, if you have a string, you should use the b prefix in Python, which will produce an instance of the bytes type and not of the str(ing) type.
I suggest you to read
What does the 'b' character do in front of a string literal?
string_literals
so I will not further explain that part.
Since I found the documentation on difflib.diff_bytes to be a bit cryptic, I decided to look directly at the code that CPython itself uses to test that function.
This is a good exercise that helps to understand how to use this function.
The code for testing difflib.diff_bytes is located (giving you're using Python 3.5) in
test_difflib

Let's check one example in that file to understand what happens.

def test_byte_content(self):


 # if we receive byte strings, we return byte strings
    a = [b'hello', b'andr\xe9']     # iso-8859-1 bytes
    b = [b'hello', b'andr\xc3\xa9'] # utf-8 bytes

    unified = difflib.unified_diff
    context = difflib.context_diff

    check = self.check
    check(difflib.diff_bytes(unified, a, a))
    check(difflib.diff_bytes(unified, a, b))

    # now with filenames (content and filenames are all bytes!)
    check(difflib.diff_bytes(unified, a, a, b'a', b'a'))
    check(difflib.diff_bytes(unified, a, b, b'a', b'b'))

    # and with filenames and dates
    check(difflib.diff_bytes(unified, a, a, b'a', b'a', b'2005', b'2013'))
    check(difflib.diff_bytes(unified, a, b, b'a', b'b', b'2005', b'2013'))

    # same all over again, with context diff
    check(difflib.diff_bytes(context, a, a))
    check(difflib.diff_bytes(context, a, b))
    check(difflib.diff_bytes(context, a, a, b'a', b'a'))
    check(difflib.diff_bytes(context, a, b, b'a', b'b'))
    check(difflib.diff_bytes(context, a, a, b'a', b'a', b'2005', b'2013'))
    check(difflib.diff_bytes(context, a, b, b'a', b'b', b'2005', b'2013'))

So as you can see, a and b are lists that contain each file's contents. Then the program defines two variables, which represent the dfunc argument to the function. Notice also the "b" prefix. difflib.diff_bytes will return the delta lines as byte objects. Then you have to write your own function to check that.
One example of that is contained in another test within that file that also includes in the diff the filename:

def test_byte_filenames(self):
    # somebody renamed a file from ISO-8859-2 to UTF-8
    fna = b'\xb3odz.txt'    # "łodz.txt"
    fnb = b'\xc5\x82odz.txt'

    # they transcoded the content at the same time
    a = [b'\xa3odz is a city in Poland.']
    b = [b'\xc5\x81odz is a city in Poland.']

    check = self.check
    unified = difflib.unified_diff
    context = difflib.context_diff
    check(difflib.diff_bytes(unified, a, b, fna, fnb))
    check(difflib.diff_bytes(context, a, b, fna, fnb))

    def assertDiff(expect, actual):
        # do not compare expect and equal as lists, because unittest
        # uses difflib to report difference between lists
        actual = list(actual)
        self.assertEqual(len(expect), len(actual))
        for e, a in zip(expect, actual):
            self.assertEqual(e, a)

    expect = [
        b'--- \xb3odz.txt',
        b'+++ \xc5\x82odz.txt',
        b'@@ -1 +1 @@',
        b'-\xa3odz is a city in Poland.',
        b'+\xc5\x81odz is a city in Poland.',
    ]
    actual = difflib.diff_bytes(unified, a, b, fna, fnb, lineterm=b'')
    assertDiff(expect, actual)

As you can see now, the filename is included in the delta lines as byte objects.

Upvotes: 3

Related Questions