Reputation: 2811
I'm using Python 2.7 to compare two text files line by line, ignoring:
Below is the code I have. It works for point 2., but it does not work for point 1. The files I'm comparing can be big, so I'm reading them line by line. Please, don't suggest zip or similar libraries.
def compare_files_by_line(fpath1, fpath2):
# notice the opening mode 'r'
with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
file1_end = False
file2_end = False
found_diff = False
while not file1_end and not file2_end and not found_diff:
try:
# reasons for stripping explained below
f1_line = next(file1).rstrip('\n')
except StopIteration:
f1_line = None
file1_end = True
try:
f2_line = next(file2).rstrip('\n')
except StopIteration:
f2_line = None
file2_end = True
if f1_line != f2_line:
if file1_end or file2_end:
if not (f1_line == '' or f2_line == ''):
found_diff = True
break
else:
found_diff = True
break
return not found_diff
You can test this code failing to meet point 1. by feeding it 2 files, one having a line ending with a UNIX newline
abc\n
the other having a line ending with a Windows newline
abc\r\n
I'm stripping the endline characters before each comparison to account for point 2. This solves the problem of two files containing a series of identical lines, this code will recognize them as "not different" even if one file ends with an empty line while the other one does not.
According to this answer, opening the files in 'r' mode (instead of 'rb') should take care of the OS-specific line endings and read them all as '\n'. This is not happening.
How can I make this work to treat line endings '\r\n' just as '\n' endings?
I'm using Python 2.7.12 with the Anaconda distribution 4.2.0.
Upvotes: 1
Views: 3709
Reputation: 1090
On the 'r'
option of open, the documentation says this:
The default is to use text mode, which may convert '\n' characters
to a platform-specific representation on writing and back on
reading. Thus, when opening a binary file, you should append 'b' to
the mode value to open the file in binary mode, which will improve portability.
So whether it converts the endline symbol is implementation specific, and you should not rely on it. (However in binary files this can cause some problems, hence the 'b'
option)
We can solve this by changing the rstrip
function to f1_line.rstrip('\r\n')
. This way the line endings are forcibly removed on all platforms.
I created a simplified version of your program below:
from itertools import izip
def compare_files(fpath1, fpath2):
with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
for linef1, linef2 in izip(file1, file2):
linef1 = linef1.rstrip('\r\n')
linef2 = linef2.rstrip('\r\n')
if linef1 != linef2:
return False
return next(file1, None) == None and next(file2, None) == None
Upvotes: 3