Agostino
Agostino

Reputation: 2811

Compare 2 files line by line ignoring newline differences

I'm using Python 2.7 to compare two text files line by line, ignoring:

  1. different line endings ('\r\n' vs '\n')
  2. number of empty lines at the end of the files

Below is the code I have. It works for point 2., but it does not work for point 1. The files I'm comparing can be big, so I'm reading them line by line. Please, don't suggest zip or similar libraries.

def compare_files_by_line(fpath1, fpath2):
    # notice the opening mode 'r'
    with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
        file1_end = False
        file2_end = False
        found_diff = False
        while not file1_end and not file2_end and not found_diff:
            try:
                # reasons for stripping explained below
                f1_line = next(file1).rstrip('\n')
            except StopIteration:
                f1_line = None
                file1_end = True
            try:
                f2_line = next(file2).rstrip('\n')
            except StopIteration:
                f2_line = None
                file2_end = True

            if f1_line != f2_line:
                if file1_end or file2_end:
                    if not (f1_line == '' or f2_line == ''):
                        found_diff = True
                        break
                else:
                    found_diff = True
                    break

    return not found_diff

You can test this code failing to meet point 1. by feeding it 2 files, one having a line ending with a UNIX newline

abc\n

the other having a line ending with a Windows newline

abc\r\n

I'm stripping the endline characters before each comparison to account for point 2. This solves the problem of two files containing a series of identical lines, this code will recognize them as "not different" even if one file ends with an empty line while the other one does not.

According to this answer, opening the files in 'r' mode (instead of 'rb') should take care of the OS-specific line endings and read them all as '\n'. This is not happening.

How can I make this work to treat line endings '\r\n' just as '\n' endings?
I'm using Python 2.7.12 with the Anaconda distribution 4.2.0.

Upvotes: 1

Views: 3709

Answers (1)

Work of Artiz
Work of Artiz

Reputation: 1090

On the 'r' option of open, the documentation says this:

The default is to use text mode, which may convert '\n' characters
to a platform-specific representation on writing and back on
reading. Thus, when opening a binary file, you should append 'b' to
the mode value to open the file in binary mode, which will improve portability.

So whether it converts the endline symbol is implementation specific, and you should not rely on it. (However in binary files this can cause some problems, hence the 'b' option)

We can solve this by changing the rstrip function to f1_line.rstrip('\r\n'). This way the line endings are forcibly removed on all platforms.

I created a simplified version of your program below:

from itertools import izip

def compare_files(fpath1, fpath2):
    with open(fpath1, 'r') as file1, open(fpath2, 'r') as file2:
        for linef1, linef2 in izip(file1, file2):
            linef1 = linef1.rstrip('\r\n')
            linef2 = linef2.rstrip('\r\n')

            if linef1 != linef2:
                return False
        return next(file1, None) == None and next(file2, None) == None

Upvotes: 3

Related Questions