Gregg Lind
Gregg Lind

Reputation: 21280

Recursively compare two directories to ensure they have the same files and subdirectories

From what I observe filecmp.dircmp is recursive, but inadequate for my needs, at least in py2. I want to compare two directories and all their contained files. Does this exist, or do I need to build (using os.walk, for example). I prefer pre-built, where someone else has already done the unit-testing :)

The actual 'comparison' can be sloppy (ignore permissions, for example), if that helps.

I would like something boolean, and report_full_closure is a printed report. It also only goes down common subdirs. AFIAC, if they have anything in the left or right dir only those are different dirs. I build this using os.walk instead.

Upvotes: 48

Views: 87018

Answers (15)

Nelson Yeung
Nelson Yeung

Reputation: 3382

Here's a tiny hack without our own recursion and algorithm:

import contextlib
import filecmp
import io
import re

def are_dirs_equal(a, b) -> bool:
    stdout = io.StringIO()
    with contextlib.redirect_stdout(stdout):
        filecmp.dircmp(a, b).report_full_closure()
    return re.search("Differing files|Only in", stdout.getvalue()) is None

Upvotes: 1

AdamE
AdamE

Reputation: 656

Based on @Mateusz Kobos currently accepted answer, it turns out that the second filecmp.cmpfiles with shallow=False is not necessary, so we've removed it. One can get dirs_cmp.diff_files from the first dircmp. A common misunderstanding (one that we made as well!) is that dir_cmp is shallow only and doesn't compare file contents! Turns out that is not true! The meaning of shallow=True is only to save time, and does not actually consider two files with differing last modification times to be different. If the last modified time is different between two files, it moves into reading each file's contents and comparing their contents. If contents are identical, then it's a match even if last modification date is different! We've added verbose prints here for added clarity. See elsewhere (filecmp.cmp() ignoring differing os.stat() signatures?) if you want to consider differences in st_modtime to be considered a mismatch. We also changed to use newer pathlib instead of os library.

import filecmp
from pathlib import Path

def compare_directories_recursive(dir1:Path, dir2:Path,verbose=True):
"""
Compares two directories recursively. 
First, file counts in each directory are compared. 
Second, files are assumed to be equal if their names, size and last modified date are equal (aka shallow=True in python terms)
If last modified date is different, then the contents are compared by reading each file. 
Caveat: if the contents are equal and last modified is NOT equal, files are still considered equal! 
This caveat is the default python filecmp behavior as unintuitive as it may seem.

@param dir1: First directory path
@param dir2: Second directory path
"""

dirs_cmp = filecmp.dircmp(str(dir1), str(dir2))
if len(dirs_cmp.left_only)>0:
    if verbose:
        print(f"Should not be any more files in original than in destination left_only: {dirs_cmp.left_only}")
    return False
if len(dirs_cmp.right_only)>0:
    if verbose:
        print(f"Should not be any more files in destination than in original right_only: {dirs_cmp.right_only}")
    return False
if len(dirs_cmp.funny_files)>0:
    if verbose:
        print(f"There should not be any funny files between original and destination. These file(s) are funny {dirs_cmp.funny_files}")
    return False
if len(dirs_cmp.diff_files)>0:
    if verbose:
        print(f"There should not be any different files between original and destination. These file(s) are different {dirs_cmp.diff_files}")
    return False

for common_dir in dirs_cmp.common_dirs:
    new_dir1 = Path(dir1).joinpath(common_dir)
    new_dir2 = Path(dir2).joinpath(common_dir)
    if not compare_directories_recursive(new_dir1, new_dir2):
        return False
return True

Upvotes: 0

Gh0sT
Gh0sT

Reputation: 317

To anyone looking for a simple library:

https://github.com/mitar/python-deep-dircmp

DeepDirCmp basically subclasses filecmp.dircmp and shows output identical to diff -qr dir1 dir2.

Usage:

from deep_dircmp import DeepDirCmp

cmp = DeepDirCmp(dir1, dir2)
if len(cmp.get_diff_files_recursive()) == 0:
    print("Dirs match")
else:
    print("Dirs don't match")

Upvotes: 0

oats
oats

Reputation: 457

This recursive function seems to work for me:

def has_differences(dcmp):
    differences = dcmp.left_only + dcmp.right_only + dcmp.diff_files
    if differences:
        return True
    return any([has_differences(subdcmp) for subdcmp in dcmp.subdirs.values()])

Assuming I haven't overlooked anything, you could just negate the result if you wanna know if directories are the same:

from filecmp import dircmp

comparison = dircmp("dir1", "dir2")
same = not has_differences(comparison)

Upvotes: 3

Guillaume Vincent
Guillaume Vincent

Reputation: 14791

Here a simple solution with a recursive function :

import filecmp

def same_folders(dcmp):
    if dcmp.diff_files or dcmp.left_only or dcmp.right_only:
        return False
    for sub_dcmp in dcmp.subdirs.values():
        if not same_folders(sub_dcmp):
            return False
    return True

same_folders(filecmp.dircmp('/tmp/archive1', '/tmp/archive2'))

Upvotes: 9

Brent
Brent

Reputation: 4283

Since a True or False result is all you want, if you have diff installed:

def are_dir_trees_equal(dir1, dir2):
    process = Popen(["diff", "-r", dir1, dir2], stdout=PIPE)
    exit_code = process.wait()
    return not exit_code

Upvotes: 2

Rok
Rok

Reputation: 416

This will check if files are in the same locations and if their content is the same. It will not correctly validate for empty subfolders.

import filecmp
import glob
import os

path_1 = '.'
path_2 = '.'

def folders_equal(f1, f2):
    file_pairs = list(zip(
        [x for x in glob.iglob(os.path.join(f1, '**'), recursive=True) if os.path.isfile(x)],
        [x for x in glob.iglob(os.path.join(f2, '**'), recursive=True) if os.path.isfile(x)]
    ))

    locations_equal = any([os.path.relpath(x, f1) == os.path.relpath(y, f2) for x, y in file_pairs])
    files_equal = all([filecmp.cmp(*x) for x in file_pairs]) 

    return locations_equal and files_equal

folders_equal(path_1, path_2)

Upvotes: 0

alzix
alzix

Reputation: 23

Based on python issue 12932 and filecmp documentation you may use following example:

import os
import filecmp

# force content compare instead of os.stat attributes only comparison
filecmp.cmpfiles.__defaults__ = (False,)

def _is_same_helper(dircmp):
    assert not dircmp.funny_files
    if dircmp.left_only or dircmp.right_only or dircmp.diff_files or dircmp.funny_files:
        return False
    for sub_dircmp in dircmp.subdirs.values():
       if not _is_same_helper(sub_dircmp):
           return False
    return True

def is_same(dir1, dir2):
    """
    Recursively compare two directories
    :param dir1: path to first directory 
    :param dir2: path to second directory
    :return: True in case directories are the same, False otherwise
    """
    if not os.path.isdir(dir1) or not os.path.isdir(dir2):
        return False
    dircmp = filecmp.dircmp(dir1, dir2)
    return _is_same_helper(dircmp)

Upvotes: 1

Philippe Ombredanne
Philippe Ombredanne

Reputation: 2025

filecmp.dircmp is the way to go. But it does not compare the content of files found with the same path in two compared directories. Instead filecmp.dircmp only looks at files attributes. Since dircmp is a class, you fix that with a dircmp subclass and override its phase3 function that compares files to ensure content is compared instead of only comparing os.stat attributes.

import filecmp

class dircmp(filecmp.dircmp):
    """
    Compare the content of dir1 and dir2. In contrast with filecmp.dircmp, this
    subclass compares the content of files with the same path.
    """
    def phase3(self):
        """
        Find out differences between common files.
        Ensure we are using content comparison with shallow=False.
        """
        fcomp = filecmp.cmpfiles(self.left, self.right, self.common_files,
                                 shallow=False)
        self.same_files, self.diff_files, self.funny_files = fcomp

Then you can use this to return a boolean:

import os.path

def is_same(dir1, dir2):
    """
    Compare two directory trees content.
    Return False if they differ, True is they are the same.
    """
    compared = dircmp(dir1, dir2)
    if (compared.left_only or compared.right_only or compared.diff_files 
        or compared.funny_files):
        return False
    for subdir in compared.common_dirs:
        if not is_same(os.path.join(dir1, subdir), os.path.join(dir2, subdir)):
            return False
    return True

In case you want to reuse this code snippet, it is hereby dedicated to the Public Domain or the Creative Commons CC0 at your choice (in addition to the default license CC-BY-SA provided by SO).

Upvotes: 27

Raullen Chai
Raullen Chai

Reputation: 315

Another solution to Compare the lay out of dir1 and dir2, ignore the content of files

See gist here: https://gist.github.com/4164344

Edit: here's the code, in case the gist gets lost for some reason:

import os

def compare_dir_layout(dir1, dir2):
    def _compare_dir_layout(dir1, dir2):
        for (dirpath, dirnames, filenames) in os.walk(dir1):
            for filename in filenames:
                relative_path = dirpath.replace(dir1, "")
                if os.path.exists( dir2 + relative_path + '\\' +  filename) == False:
                    print relative_path, filename
        return

    print 'files in "' + dir1 + '" but not in "' + dir2 +'"'
    _compare_dir_layout(dir1, dir2)
    print 'files in "' + dir2 + '" but not in "' + dir1 +'"'
    _compare_dir_layout(dir2, dir1)


compare_dir_layout('xxx', 'yyy')

Upvotes: 3

NotAUser
NotAUser

Reputation: 1456

def same(dir1, dir2):
"""Returns True if recursively identical, False otherwise

"""
    c = filecmp.dircmp(dir1, dir2)
    if c.left_only or c.right_only or c.diff_files or c.funny_files:
        return False
    else:
        safe_so_far = True
        for i in c.common_dirs:
            same_so_far = same_so_far and same(os.path.join(frompath, i), os.path.join(topath, i))
            if not same_so_far:
                break
        return same_so_far

Upvotes: 0

Mateusz Kobos
Mateusz Kobos

Reputation: 641

Here's an alternative implementation of the comparison function with filecmp module. It uses a recursion instead of os.walk, so it is a little simpler. However, it does not recurse simply by using common_dirs and subdirs attributes since in that case we would be implicitly using the default "shallow" implementation of files comparison, which is probably not what you want. In the implementation below, when comparing files with the same name, we're always comparing only their contents.

import filecmp
import os.path

def are_dir_trees_equal(dir1, dir2):
    """
    Compare two directories recursively. Files in each directory are
    assumed to be equal if their names and contents are equal.

    @param dir1: First directory path
    @param dir2: Second directory path

    @return: True if the directory trees are the same and 
        there were no errors while accessing the directories or files, 
        False otherwise.
   """

    dirs_cmp = filecmp.dircmp(dir1, dir2)
    if len(dirs_cmp.left_only)>0 or len(dirs_cmp.right_only)>0 or \
        len(dirs_cmp.funny_files)>0:
        return False
    (_, mismatch, errors) =  filecmp.cmpfiles(
        dir1, dir2, dirs_cmp.common_files, shallow=False)
    if len(mismatch)>0 or len(errors)>0:
        return False
    for common_dir in dirs_cmp.common_dirs:
        new_dir1 = os.path.join(dir1, common_dir)
        new_dir2 = os.path.join(dir2, common_dir)
        if not are_dir_trees_equal(new_dir1, new_dir2):
            return False
    return True

Upvotes: 38

Gregg Lind
Gregg Lind

Reputation: 21280

Here is my solution: gist

def dirs_same_enough(dir1,dir2,report=False):
    ''' use os.walk and filecmp.cmpfiles to
    determine if two dirs are 'same enough'.

    Args:
        dir1, dir2:  two directory paths
        report:  if True, print the filecmp.dircmp(dir1,dir2).report_full_closure()
                 before returning

    Returns:
        bool

    '''
    # os walk:  root, list(dirs), list(files)
    # those lists won't have consistent ordering,
    # os.walk also has no guaranteed ordering, so have to sort.
    walk1 = sorted(list(os.walk(dir1)))
    walk2 = sorted(list(os.walk(dir2)))

    def report_and_exit(report,bool_):
        if report:
            filecmp.dircmp(dir1,dir2).report_full_closure()
            return bool_
        else:
            return bool_

    if len(walk1) != len(walk2):
        return false_or_report(report)

    for (p1,d1,fl1),(p2,d2,fl2) in zip(walk1,walk2):
        d1,fl1, d2, fl2 = set(d1),set(fl1),set(d2),set(fl2)
        if d1 != d2 or fl1 != fl2:
            return report_and_exit(report,False)
        for f in fl1:
            same,diff,weird = filecmp.cmpfiles(p1,p2,fl1,shallow=False)
            if diff or weird:
                return report_and_exit(report,False)

    return report_and_exit(report,True)

Upvotes: 0

Katriel
Katriel

Reputation: 123662

dircmp can be recursive: see report_full_closure.

As far as I know dircmp does not offer a directory comparison function. It would be very easy to write your own, though; use left_only and right_only on dircmp to check that the files in the directories are the same and then recurse on the subdirs attribute.

Upvotes: 3

asthasr
asthasr

Reputation: 9407

The report_full_closure() method is recursive:

comparison = filecmp.dircmp('/directory1', '/directory2')
comparison.report_full_closure()

Edit: After the OP's edit, I would say that it's best to just use the other functions in filecmp. I think os.walk is unnecessary; better to simply recurse through the lists produced by common_dirs, etc., although in some cases (large directory trees) this might risk a Max Recursion Depth error if implemented poorly.

Upvotes: 6

Related Questions