Reputation: 17018

Fastest way to tell if two files have the same contents in Unix/Linux?

I have a shell script in which I need to check whether two files contain the same data or not. I do this a for a lot of files, and in my script the diff command seems to be the performance bottleneck.

Here's the line:

diff -q $dst $new > /dev/null

if ($status) then ...

Could there be a faster way to compare the files, maybe a custom algorithm instead of the default diff?

Upvotes: 383

Answers (10)

Sinjai

Reputation: 1277

You could compare hashes, e.g. SHA-256 or MD5.

same-contents() {
    echo "$(sha256sum "$1" | sed 's/ .*//') $2" | sha256sum --check 1>/dev/null 2>&1
}
alias same-file="same-contents" # Technically the same "file" has the same inode

if same-contents file1.txt file2.txt; then 
    echo true
else
    echo false
fi

That is essentially what was suggested in this answer, but they overcomplicate things by storing the hash in a file first. The sed call trims off the filename to leave only the hash.

You should likely still prefer cmp because it is purpose-built for this and can compare byte-by-byte, whereas hashing necessitates reading the entire files, but I thought this was an interesting script.

Upvotes: 1

the Hutt

Reputation: 18438

If you are looking for more customizable diff for this, then git diff can be used.

if (git diff --no-index --quiet old.txt new.txt) then
  echo "files contents are identical"
else
  echo "files differ"
fi

--quiet

Disable all output of the program. Implies --exit-code.

--exit-code

Make the program exit with codes similar to diff(1). That is, it exits with 1 if there were differences and 0 means no differences.

Also, there are various algorithms and settings to choose from: [ref]

--diff-algorithm={patience|minimal|histogram|myers}

Choose a diff algorithm. The variants are as follows:

default, myers The basic greedy diff algorithm. Currently, this is the default.

minimal Spend extra time to make sure the smallest possible diff is produced.

patience Use "patience diff" algorithm when generating patches.

histogram This algorithm extends the patience algorithm to "support low-occurrence common elements".

Upvotes: 0

Jack Simth

Reputation: 151

Doing some testing with a Raspberry Pi 3B+ (I'm using an overlay file system, and need to sync periodically), I ran a comparison of my own for diff -q and cmp -s; note that this is a log from inside /dev/shm, so disk access speeds are a non-issue:

[root@mypi shm]# dd if=/dev/urandom of=test.file bs=1M count=100 ; time diff -q test.file test.copy && echo diff true || echo diff false ; time cmp -s test.file test.copy && echo cmp true || echo cmp false ; cp -a test.file test.copy ; time diff -q test.file test.copy && echo diff true || echo diff false; time cmp -s test.file test.copy && echo cmp true || echo cmp false
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 6.2564 s, 16.8 MB/s
Files test.file and test.copy differ

real    0m0.008s
user    0m0.008s
sys     0m0.000s
diff false

real    0m0.009s
user    0m0.007s
sys     0m0.001s
cmp false
cp: overwrite âtest.copyâ? y

real    0m0.966s
user    0m0.447s
sys     0m0.518s
diff true

real    0m0.785s
user    0m0.211s
sys     0m0.573s
cmp true
[root@mypi shm]# pico /root/rwbscripts/utils/squish.sh

I ran it a couple of times. cmp -s consistently had slightly shorter times on the test box I was using. So if you want to use cmp -s to do things between two files....

identical (){
  echo "$1" and "$2" are the same.
  echo This is a function, you can put whatever you want in here.
}
different () {
  echo "$1" and "$2" are different.
  echo This is a function, you can put whatever you want in here, too.
}
cmp -s "$FILEA" "$FILEB" && identical "$FILEA" "$FILEB" || different "$FILEA" "$FILEB"

Upvotes: 2

Nono Taps

Reputation: 163

Try also to use the cksum command:

chk1=`cksum <file1> | awk -F" " '{print $1}'`
chk2=`cksum <file2> | awk -F" " '{print $1}'`

if [ $chk1 -eq $chk2 ]
then
  echo "File is identical"
else
  echo "File is not identical"
fi

The cksum command will output the byte count of a file. See 'man cksum'.

Upvotes: 3

jim mcnamara

Reputation: 16399

For files that are not different, any method will require having read both files entirely, even if the read was in the past.

There is no alternative. So creating hashes or checksums at some point in time requires reading the whole file. Big files take time.

File metadata retrieval is much faster than reading a large file.

So, is there any file metadata you can use to establish that the files are different? File size ? or even results of the file command which does just read a small portion of the file?

File size example code fragment:

  ls -l $1 $2 | 
  awk 'NR==1{a=$5} NR==2{b=$5} 
       END{val=(a==b)?0 :1; exit( val) }'
       
[ $? -eq 0 ] && echo 'same' || echo 'different'

If the files are the same size then you are stuck with full file reads.

Upvotes: 3

rafapc2

Reputation: 466

You can compare by checksum algorithm like sha256

sha256sum oldFile > oldFile.sha256

echo "$(cat oldFile.sha256) newFile" | sha256sum --check

newFile: OK

if the files are distinct the result will be

newFile: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

Upvotes: 9

Gregory Martin

Reputation: 599

Because I suck and don't have enough reputation points I can't add this tidbit in as a comment.

But, if you are going to use the cmp command (and don't need/want to be verbose) you can just grab the exit status. Per the cmp man page:

If a FILE is '-' or missing, read standard input. Exit status is 0 if inputs are the same, 1 if different, 2 if trouble.

So, you could do something like:

STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)"  # "$?" gives exit status for each comparison

if [[ $STATUS -ne 0 ]]; then  # if status isn't equal to 0, then execute code
    DO A COMMAND ON $FILE1
else
    DO SOMETHING ELSE
fi

EDIT: Thanks for the comments everyone! I updated the test syntax here. However, I would suggest you use Vasili's answer if you are looking for something similar to this answer in readability, style, and syntax.

Upvotes: 23

pn1 dude

Reputation: 4420

I like @Alex Howansky have used 'cmp --silent' for this. But I need both positive and negative response so I use:

cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'

I can then run this in the terminal or with a ssh to check files against a constant file.

Upvotes: 90

Alex Howansky

Reputation: 53646

I believe cmp will stop at the first byte difference:

cmp --silent $old $new || echo "files are different"

Upvotes: 600

VasiliNovikov

Reputation: 10296

To quickly and safely compare any two files:

if cmp --silent -- "$FILE1" "$FILE2"; then
  echo "files contents are identical"
else
  echo "files differ"
fi

It's readable, efficient, and works for any file names including "` $()

Upvotes: 56

Fastest way to tell if two files have the same contents in Unix/Linux?

Answers (10)

Related Questions