Reputation: 17018
I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diff
command seems to be the performance bottleneck.
Here's the line:
diff -q $dst $new > /dev/null
if ($status) then ...
Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff
?
Upvotes: 383
Views: 356455
Reputation: 1277
You could compare hashes, e.g. SHA-256 or MD5.
same-contents() {
echo "$(sha256sum "$1" | sed 's/ .*//') $2" | sha256sum --check 1>/dev/null 2>&1
}
alias same-file="same-contents" # Technically the same "file" has the same inode
if same-contents file1.txt file2.txt; then
echo true
else
echo false
fi
That is essentially what was suggested in this answer, but they overcomplicate things by storing the hash in a file first. The sed
call trims off the filename to leave only the hash.
You should likely still prefer cmp
because it is purpose-built for this and can compare byte-by-byte, whereas hashing necessitates reading the entire files, but I thought this was an interesting script.
Upvotes: 1
Reputation: 18438
If you are looking for more customizable diff for this, then git diff
can be used.
if (git diff --no-index --quiet old.txt new.txt) then
echo "files contents are identical"
else
echo "files differ"
fi
--quiet
Disable all output of the program. Implies --exit-code.
--exit-code
Make the program exit with codes similar to diff(1). That is, it exits with 1 if there were differences and 0 means no differences.
Also, there are various algorithms and settings to choose from: [ref]
--diff-algorithm={patience|minimal|histogram|myers}
Choose a diff algorithm. The variants are as follows:
default, myers The basic greedy diff algorithm. Currently, this is the default.
minimal Spend extra time to make sure the smallest possible diff is produced.
patience Use "patience diff" algorithm when generating patches.
histogram This algorithm extends the patience algorithm to "support low-occurrence common elements".
Upvotes: 0
Reputation: 151
Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:
[root@mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ
real 0m0.008s
user 0m0.008s
sys 0m0.000s
diff false
real 0m0.009s
user 0m0.007s
sys 0m0.001s
cmp false
cp: overwrite âtest.copyâ? y
real 0m0.966s
user 0m0.447s
sys 0m0.518s
diff true
real 0m0.785s
user 0m0.211s
sys 0m0.573s
cmp true
[root@mypi shm]# pico /root/rwbscripts/utils/squish.sh
I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....
identical (){
echo "$1" and "$2" are the same.
echo This is a function, you can put whatever you want in here.
}
different () {
echo "$1" and "$2" are different.
echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"
Upvotes: 2
Reputation: 163
Try also to use the cksum command:
chk1=`cksum <file1> | awk -F" " '{print $1}'`
chk2=`cksum <file2> | awk -F" " '{print $1}'`
if [ $chk1 -eq $chk2 ]
then
echo "File is identical"
else
echo "File is not identical"
fi
The cksum command will output the byte count of a file. See 'man cksum'.
Upvotes: 3
Reputation: 16399
For files that are not different, any method will require having read both files entirely, even if the read was in the past.
There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.
File metadata retrieval is much faster than reading a large file.
So, is there any file metadata you can use to establish that the files are different? File size ? or even results of the file command which does just read a small portion of the file?
File size example code fragment:
ls -l $1 $2 |
awk 'NR==1{a=$5} NR==2{b=$5}
END{val=(a==b)?0 :1; exit( val) }'
[ $? -eq 0 ] && echo 'same' || echo 'different'
If the files are the same size then you are stuck with full file reads.
Upvotes: 3
Reputation: 466
You can compare by checksum algorithm like sha256
sha256sum oldFile > oldFile.sha256
echo "$(cat oldFile.sha256) newFile" | sha256sum --check
newFile: OK
if the files are distinct the result will be
newFile: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
Upvotes: 9
Reputation: 599
Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.
But, if you are going to use the cmp
command (and don't need/want to be verbose) you can just grab the exit status. Per the cmp
man page:
If a FILE is '-' or missing, read standard input. Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.
So, you could do something like:
STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)" # "$?" gives exit status for each comparison
if [[ $STATUS -ne 0 ]]; then # if status isn't equal to 0, then execute code
DO A COMMAND ON $FILE1
else
DO SOMETHING ELSE
fi
EDIT: Thanks for the comments everyone! I updated the test syntax here. However, I would suggest you use Vasili's answer if you are looking for something similar to this answer in readability, style, and syntax.
Upvotes: 23
Reputation: 4420
I like @Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:
cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'
I can then run this in the terminal or with a ssh to check files against a constant file.
Upvotes: 90
Reputation: 53646
I believe cmp
will stop at the first byte difference:
cmp --silent $old $new || echo "files are different"
Upvotes: 600
Reputation: 10296
To quickly and safely compare any two files:
if cmp --silent -- "$FILE1" "$FILE2"; then
echo "files contents are identical"
else
echo "files differ"
fi
It's readable, efficient, and works for any file names including "` $()
Upvotes: 56