WalkingRandomly
WalkingRandomly

Reputation: 4557

Programmatically compare file sizes in linux

I have two versions of a very large and complicated directory structure with tens of thousands of individual files and I want to look for significant file changes from one version to another.

Each and every file has changed in some minor way. For example you might have a file called intro.txt which would contain

[Build 1057 done by Mike 12:00] - (version 1)

[Build 1065 done by Mike 18:10] - (version 2)

I don't care about changes like that since they contain no useful information. I also don't care about corrections to spelling mistakes or the addition of a word or two.

What I really want to do is pull out which files have changed in a more major way. One way they might have changed is for a lot of extra content to have been added which would increase the filesize - that's the kind of change I am interested in.

So, how would you recursively parse through the directories looking for files that have increased (or decreased) by a set amount from one version to the next.

I'm running linux but pretty much any language will do.

Upvotes: 3

Views: 3982

Answers (7)

draegtun
draegtun

Reputation: 22560

There are a few modules on CPAN that provide this. For eg.

File::DirCompare looks most promising....

 use File::DirCompare;

 File::DirCompare->compare('dirA', 'dirB', sub {
     my ($a, $b) = @_;

     ... callback runs on different or missing files   ...
     ... so perform extra checks on files $a & $b here ...

 });

So one example of showing files that are different by more than a prescribed number of bytes would be....

File::DirCompare->compare('dirA', 'dirB', size_diff_by_more_than(1024) );

sub size_diff_by_more_than {
    my $this = shift;

    return sub {
        my @files = grep { $_ } @_;

        if ( @files == 2 ) {
            # get the two file sizes and report if more than $this
            my @sizes = sort { $a <=> $b } map { (stat)[7] } @files;
            print "Different by more than $this bytes: $files[1]\n"
                if $sizes[1] - $sizes[0] > $this
        }
        else {
            print "Only: $files[0]\n";
        }
    };
}

Upvotes: 4

dicroce
dicroce

Reputation: 46770

In C, you call stat on the files.

#include 
#include 
#include 

int main( int argc, char* argv[] )
{
   struct stat fileInfoA;
   struct stat fileInfoB;

   if( argc == 3 )
   {
     stat( argv[1], &fileInfoA );
     stat( argv[2], &fileInfoB );

     // Now, you can use the following fields of stat to compare the files:
     //      struct stat {
     //          dev_t     st_dev;     /* ID of device containing file */
     //          ino_t     st_ino;     /* inode number */
     //          mode_t    st_mode;    /* protection */
     //          nlink_t   st_nlink;   /* number of hard links */
     //          uid_t     st_uid;     /* user ID of owner */
     //          gid_t     st_gid;     /* group ID of owner */
     //          dev_t     st_rdev;    /* device ID (if special file) */
     //          off_t     st_size;    /* total size, in bytes */
     //          blksize_t st_blksize; /* blocksize for filesystem I/O */
     //          blkcnt_t  st_blocks;  /* number of blocks allocated */
     //          time_t    st_atime;   /* time of last access */
     //          time_t    st_mtime;   /* time of last modification */
     //          time_t    st_ctime;   /* time of last status change */
     //      };

   }

Now, that's useful for comparing individual files. To compare recursively files in a directory you will obviously need to use recursion (or a stack). You will also need the opendir() and readdir() system calls.

Upvotes: 2

user55400
user55400

Reputation: 4029

On the point of determining the amount of difference between two files:

It might be good to run a diff of the two files and put the length of the diff output in relation to the overall size of the file.

This (in addition to a file size comparison) would catch cases where there were a lot of changes in the file but the overall file size did not change significantly. This may or may not be appropriate for your use case.

Upvotes: 0

Eugene Morozov
Eugene Morozov

Reputation: 15806

You can generate a diff of the two directories, and use diffstat utility on it. Diffstat reports statistics on changed files: how many lines were added, removed or modified. I guess this will give you more information than just comparing file sizes.

Upvotes: 2

Daniel Watkins
Daniel Watkins

Reputation: 1686

In bash:

before_dir=foo.old
after_dir=foo.new
interesting_size=10
for file in `find $before_dir -type f`; do
    diff_size=$(diff -u "$file" "$after_dir$(echo $file | sed "s,$before_dir,,")" | wc -l)
    if [ $diff_size -ge $interesting_size ]; then
        echo $file;
    fi;
done

Upvotes: 2

unwind
unwind

Reputation: 399793

I'd do a diff -r -b FOLDER1 FOLDER2 to get a list of files that have changed, then process that list (using a bash script is sufficient) and just check the size difference for each file, and print the filename if the difference exceeds a threshold.

The -b option to diff is for brief output, it just prints a line for each difference found, it doesn't print per-line changes.

The -r is for recursive comparison of two directories, as often.

Upvotes: 2

Douglas Leeder
Douglas Leeder

Reputation: 53310

In python you want to start with the filecmp module.

Compare the directories - then print out files which are missing from one or other side (left_only and right_only).

Then for the diff_files you need to do more details comparison - use os.stat to find out the sizes, and print out the filename if the difference is too large.

Finally you need to recurse into common subdirectories.

Upvotes: 3

Related Questions