Corey Trager
Corey Trager

Reputation: 23123

In Python, is there a concise way of comparing whether the contents of two text files are the same?

I don't care what the differences are. I just want to know whether the contents are different.

Upvotes: 83

Views: 47618

Answers (10)

Angel
Angel

Reputation: 2865

You should try to use filecomp.cmp built-in function as @Federico said in the first answer.

I you want to have more control on the file comparison in order to implement own business logic you can use my solution inspired in filecomp code

https://github.com/python/cpython/blob/e1a223431f49926bfaf5dbc58aab2becf3253972/Lib/filecmp.py#L75

Simple and efficient solution:

import os


def is_file_content_equal(
    file_path_1: str, file_path_2: str, buffer_size: int = 1024 * 8
) -> bool:
    """Checks if two files content is equal"""
    # First check sizes
    s1, s2 = os.path.getsize(file_path_1), os.path.getsize(file_path_2)
    if s1 != s2:
        return False
    
    # If the sizes are the same check the content
    with open(file_path_1, "rb") as fp1, open(file_path_2, "rb") as fp2:
        while True:
            b1 = fp1.read(buffer_size)
            b2 = fp2.read(buffer_size)
            if b1 != b2:
                return False
    
            # if the content is the same and they are both empty bytes
            # the file is the same
            if not b1:
                return True

Upvotes: 1

Jake Weilhammer
Jake Weilhammer

Reputation: 11

filecmp is great for easy comparison of files, but doesn't allow you to print the line number or difference in the files:

import filecmp

def compare_files(filename1, filename2):
    return filecmp.cmp(filename1, filename2, shallow=False)

Here's a simple and efficient solution that is a bit more flexible in that you can print status of comparison, line numbers, and the line values of where there is a difference in the files:

def compare_with_line_diff(filename1, filename2):
    with open(filename1, "r") as file1, open(filename2, "r") as file2:

        # Loop for all lines in first file (keep only 2 lines in memory)
        for line_num, f1_line in enumerate(file1, start=1):

            # Only print status for range of lines
            if (line_num == 1 or line_num % 1000 == 0):
                print(f"comparing lines {line_num} to {line_num + 1000}")

            # Compare with next line of file2
            f2_line = file2.readline()
            if (f1_line != f2_line):
                print(f"Difference on line: {line_num}")
                print(f"f1_line: '{f1_line}'")
                print(f"f2_line: '{f2_line}'")
                return False

        # Check if file2 has more lines than file1
        for extra_line in file2:
            print(f"Difference on file2: {extra_line}")
            return False

    # Files are equal
    return True

Upvotes: 1

tzot
tzot

Reputation: 95931

This is a functional-style file comparison function. It returns instantly False if the files have different sizes; otherwise, it reads in 4KiB block sizes and returns False instantly upon the first difference:

from __future__ import with_statement
import os
import itertools, functools, operator
try:
    izip= itertools.izip  # Python 2
except AttributeError:
    izip= zip  # Python 3

def filecmp(filename1, filename2):
    "Do the two files have exactly the same contents?"
    with open(filename1, "rb") as fp1, open(filename2, "rb") as fp2:
        if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
            return False # different sizes ∴ not equal

        # set up one 4k-reader for each file
        fp1_reader= functools.partial(fp1.read, 4096)
        fp2_reader= functools.partial(fp2.read, 4096)

        # pair each 4k-chunk from the two readers while they do not return '' (EOF)
        cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b''))

        # return True for all pairs that are not equal
        inequalities= itertools.starmap(operator.ne, cmp_pairs)

        # voilà; any() stops at first True value
        return not any(inequalities)

if __name__ == "__main__":
    import sys
    print filecmp(sys.argv[1], sys.argv[2])

Just a different take :)

Upvotes: 16

Prashanth Babu
Prashanth Babu

Reputation: 21

from __future__ import with_statement

filename1 = "G:\\test1.TXT"

filename2 = "G:\\test2.TXT"


with open(filename1) as f1:

   with open(filename2) as f2:

      file1list = f1.read().splitlines()

      file2list = f2.read().splitlines()

      list1length = len(file1list)

      list2length = len(file2list)

      if list1length == list2length:

          for index in range(len(file1list)):

              if file1list[index] == file2list[index]:

                   print file1list[index] + "==" + file2list[index]

              else:                  

                   print file1list[index] + "!=" + file2list[index]+" Not-Equel"

      else:

          print "difference inthe size of the file and number of lines"

Upvotes: 1

Federico A. Ramponi
Federico A. Ramponi

Reputation: 47075

The low level way:

from __future__ import with_statement
with open(filename1) as f1:
   with open(filename2) as f2:
      if f1.read() == f2.read():
         ...

The high level way:

import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
   ...

Upvotes: 101

user32141
user32141

Reputation: 248

Since I can't comment on the answers of others I'll write my own.

If you use md5 you definitely must not just md5.update(f.read()) since you'll use too much memory.

def get_file_md5(f, chunk_size=8192):
    h = hashlib.md5()
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        h.update(chunk)
    return h.hexdigest()

Upvotes: 7

Jeremy Cantrell
Jeremy Cantrell

Reputation: 27426

I would use a hash of the file's contents using MD5.

import hashlib

def checksum(f):
    md5 = hashlib.md5()
    md5.update(open(f).read())
    return md5.hexdigest()

def is_contents_same(f1, f2):
    return checksum(f1) == checksum(f2)

if not is_contents_same('foo.txt', 'bar.txt'):
    print 'The contents are not the same!'

Upvotes: 5

ConcernedOfTunbridgeWells
ConcernedOfTunbridgeWells

Reputation: 66612

For larger files you could compute a MD5 or SHA hash of the files.

Upvotes: 1

Rich
Rich

Reputation: 2903

If you're going for even basic efficiency, you probably want to check the file size first:

if os.path.getsize(filename1) == os.path.getsize(filename2):
  if open('filename1','r').read() == open('filename2','r').read():
    # Files are the same.

This saves you reading every line of two files that aren't even the same size, and thus can't be the same.

(Even further than that, you could call out to a fast MD5sum of each file and compare those, but that's not "in Python", so I'll stop here.)

Upvotes: 37

mmattax
mmattax

Reputation: 27670


f = open(filename1, "r").read()
f2 = open(filename2,"r").read()
print f == f2


Upvotes: 2

Related Questions