Reputation: 75
Let's say I want to compare file a and file b with the difflib.diff_bytes
function, how would I do this?
Thanks
Upvotes: 6
Views: 4257
Reputation: 4609
In the following I will assume you have Python 3.x (specifically 3.5).
Let's analyze the documentation to try to understand the function:
difflib.diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n')
Compare a and b (lists of bytes objects) using dfunc; yield a sequence of delta lines (also bytes) in the format returned by dfunc. dfunc must be a callable, typically either unified_diff() or context_diff().Allows you to compare data with unknown or inconsistent encoding. All inputs except n must be bytes objects, not str. Works by losslessly converting all inputs (except n) to str, and calling dfunc(a, b, fromfile, tofile, fromfiledate, tofiledate, n, lineterm). The output of dfunc is then converted back to bytes, so the delta lines that you receive have the same unknown/inconsistent encodings as a and b.
First thing to notice is the distinction done between bytes objects and str(ing) objects. Then every input arguments except n
must bytes objects.
So the key is that you use this function and pass byte objects to it, not strings. So, if you have a string, you should use the b
prefix in Python, which will produce an instance of the bytes type and not of the str(ing) type.
I suggest you to read
What does the 'b' character do in front of a string literal?
string_literals
so I will not further explain that part.
Since I found the documentation on difflib.diff_bytes
to be a bit cryptic, I decided to look directly at the code that CPython itself uses to test that function.
This is a good exercise that helps to understand how to use this function.
The code for testing difflib.diff_bytes
is located (giving you're using Python 3.5) in
test_difflib
Let's check one example in that file to understand what happens.
def test_byte_content(self):
# if we receive byte strings, we return byte strings
a = [b'hello', b'andr\xe9'] # iso-8859-1 bytes
b = [b'hello', b'andr\xc3\xa9'] # utf-8 bytes
unified = difflib.unified_diff
context = difflib.context_diff
check = self.check
check(difflib.diff_bytes(unified, a, a))
check(difflib.diff_bytes(unified, a, b))
# now with filenames (content and filenames are all bytes!)
check(difflib.diff_bytes(unified, a, a, b'a', b'a'))
check(difflib.diff_bytes(unified, a, b, b'a', b'b'))
# and with filenames and dates
check(difflib.diff_bytes(unified, a, a, b'a', b'a', b'2005', b'2013'))
check(difflib.diff_bytes(unified, a, b, b'a', b'b', b'2005', b'2013'))
# same all over again, with context diff
check(difflib.diff_bytes(context, a, a))
check(difflib.diff_bytes(context, a, b))
check(difflib.diff_bytes(context, a, a, b'a', b'a'))
check(difflib.diff_bytes(context, a, b, b'a', b'b'))
check(difflib.diff_bytes(context, a, a, b'a', b'a', b'2005', b'2013'))
check(difflib.diff_bytes(context, a, b, b'a', b'b', b'2005', b'2013'))
So as you can see, a and b are lists that contain each file's contents. Then the program defines two variables, which represent the dfunc
argument to the function. Notice also the "b" prefix. difflib.diff_bytes
will return the delta lines as byte objects. Then you have to write your own function to check that.
One example of that is contained in another test within that file that also includes in the diff the filename:
def test_byte_filenames(self):
# somebody renamed a file from ISO-8859-2 to UTF-8
fna = b'\xb3odz.txt' # "łodz.txt"
fnb = b'\xc5\x82odz.txt'
# they transcoded the content at the same time
a = [b'\xa3odz is a city in Poland.']
b = [b'\xc5\x81odz is a city in Poland.']
check = self.check
unified = difflib.unified_diff
context = difflib.context_diff
check(difflib.diff_bytes(unified, a, b, fna, fnb))
check(difflib.diff_bytes(context, a, b, fna, fnb))
def assertDiff(expect, actual):
# do not compare expect and equal as lists, because unittest
# uses difflib to report difference between lists
actual = list(actual)
self.assertEqual(len(expect), len(actual))
for e, a in zip(expect, actual):
self.assertEqual(e, a)
expect = [
b'--- \xb3odz.txt',
b'+++ \xc5\x82odz.txt',
b'@@ -1 +1 @@',
b'-\xa3odz is a city in Poland.',
b'+\xc5\x81odz is a city in Poland.',
]
actual = difflib.diff_bytes(unified, a, b, fna, fnb, lineterm=b'')
assertDiff(expect, actual)
As you can see now, the filename is included in the delta lines as byte objects.
Upvotes: 3