Reputation: 4557
I have two versions of a very large and complicated directory structure with tens of thousands of individual files and I want to look for significant file changes from one version to another.
Each and every file has changed in some minor way. For example you might have a file called intro.txt which would contain
[Build 1057 done by Mike 12:00] - (version 1)
[Build 1065 done by Mike 18:10] - (version 2)
I don't care about changes like that since they contain no useful information. I also don't care about corrections to spelling mistakes or the addition of a word or two.
What I really want to do is pull out which files have changed in a more major way. One way they might have changed is for a lot of extra content to have been added which would increase the filesize - that's the kind of change I am interested in.
So, how would you recursively parse through the directories looking for files that have increased (or decreased) by a set amount from one version to the next.
I'm running linux but pretty much any language will do.
Upvotes: 3
Views: 3982
Reputation: 22560
There are a few modules on CPAN that provide this. For eg.
File::DirCompare looks most promising....
use File::DirCompare;
File::DirCompare->compare('dirA', 'dirB', sub {
my ($a, $b) = @_;
... callback runs on different or missing files ...
... so perform extra checks on files $a & $b here ...
});
So one example of showing files that are different by more than a prescribed number of bytes would be....
File::DirCompare->compare('dirA', 'dirB', size_diff_by_more_than(1024) );
sub size_diff_by_more_than {
my $this = shift;
return sub {
my @files = grep { $_ } @_;
if ( @files == 2 ) {
# get the two file sizes and report if more than $this
my @sizes = sort { $a <=> $b } map { (stat)[7] } @files;
print "Different by more than $this bytes: $files[1]\n"
if $sizes[1] - $sizes[0] > $this
}
else {
print "Only: $files[0]\n";
}
};
}
Upvotes: 4
Reputation: 46770
In C, you call stat on the files.
#include #include #include int main( int argc, char* argv[] ) { struct stat fileInfoA; struct stat fileInfoB; if( argc == 3 ) { stat( argv[1], &fileInfoA ); stat( argv[2], &fileInfoB ); // Now, you can use the following fields of stat to compare the files: // struct stat { // dev_t st_dev; /* ID of device containing file */ // ino_t st_ino; /* inode number */ // mode_t st_mode; /* protection */ // nlink_t st_nlink; /* number of hard links */ // uid_t st_uid; /* user ID of owner */ // gid_t st_gid; /* group ID of owner */ // dev_t st_rdev; /* device ID (if special file) */ // off_t st_size; /* total size, in bytes */ // blksize_t st_blksize; /* blocksize for filesystem I/O */ // blkcnt_t st_blocks; /* number of blocks allocated */ // time_t st_atime; /* time of last access */ // time_t st_mtime; /* time of last modification */ // time_t st_ctime; /* time of last status change */ // }; }
Now, that's useful for comparing individual files. To compare recursively files in a directory you will obviously need to use recursion (or a stack). You will also need the opendir() and readdir() system calls.
Upvotes: 2
Reputation: 4029
On the point of determining the amount of difference between two files:
It might be good to run a diff of the two files and put the length of the diff output in relation to the overall size of the file.
This (in addition to a file size comparison) would catch cases where there were a lot of changes in the file but the overall file size did not change significantly. This may or may not be appropriate for your use case.
Upvotes: 0
Reputation: 15806
You can generate a diff of the two directories, and use diffstat utility on it. Diffstat reports statistics on changed files: how many lines were added, removed or modified. I guess this will give you more information than just comparing file sizes.
Upvotes: 2
Reputation: 1686
In bash:
before_dir=foo.old
after_dir=foo.new
interesting_size=10
for file in `find $before_dir -type f`; do
diff_size=$(diff -u "$file" "$after_dir$(echo $file | sed "s,$before_dir,,")" | wc -l)
if [ $diff_size -ge $interesting_size ]; then
echo $file;
fi;
done
Upvotes: 2
Reputation: 399793
I'd do a diff -r -b FOLDER1 FOLDER2
to get a list of files that have changed, then process that list (using a bash script is sufficient) and just check the size difference for each file, and print the filename if the difference exceeds a threshold.
The -b
option to diff
is for brief output, it just prints a line for each difference found, it doesn't print per-line changes.
The -r
is for recursive comparison of two directories, as often.
Upvotes: 2
Reputation: 53310
In python you want to start with the filecmp module.
Compare the directories - then print out files which are missing from one or other side (left_only and right_only).
Then for the diff_files you need to do more details comparison - use os.stat
to find out the sizes, and print out the filename if the difference is too large.
Finally you need to recurse into common subdirectories.
Upvotes: 3