ojblass
ojblass

Reputation: 21620

What is the fastest way to check if files are identical?

If you have 1,000,0000 source files, you suspect they are all the same, and you want to compare them what is the current fasted method to compare those files? Assume they are Java files and platform where the comparison is done is not important. cksum is making me cry. When I mean identical I mean ALL identical.

Update: I know about generating checksums. diff is laughable ... I want speed.

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

Update: read the number of blocks rather than bytes? Immediatly throw out those? Is that faster than finding the number of bytes?

Update: Is this ANY different than the fastest way to compare two files?

Upvotes: 40

Views: 49391

Answers (19)

Thomas Padron-McCarthy
Thomas Padron-McCarthy

Reputation: 27632

There are a number of programs that compare a set of files in general to find identical ones. FDUPES is a good one: Link. A million files shoudln't be a a problem, depending on the exact nature of the input. I think that FDUPES requires Linux, but there are other such programs for other platforms.

I tried to write a faster program myself, but except for special cases, FDUPES was faster.

Anyway, the general idea is to start by checking the sizes of the files. Files that have different sizes can't be equal, so you only need to look at groups of files with the same size. Then it gets more complicated if you want optimal performance: If the files are likely to be different, you should compare small parts of the files, in the hope of finding differences early, so you don't have to read the rest of them. If the files are likely to be identical, though, it can be faster to read through each file to calculate a checksum, because then you can read sequentially from the disk instead of jumping back and forth between two or more files. (This assumes normal disks, so SSD:s may be different.)

In my benchmarks when trying to make a faster program it (somewhat to my surprise) turned out to be faster to first read through each file to calculate a checksum, and then if the checksums were equal, compare the files directly by reading a blocks alternately from each file, than to just read blocks alternately without the previous checksum calculations! It turned out that when calculating the checksums, Linux cached both files in main memory, reading each file sequentially, and the second reads were then very fast. When starting with alternating reads, the files were not (physically) read sequentially.

EDIT:

Some people have expressed surprise end even doubt that it could be faster to read the files twice than reading them just once. Perhaps I didn't manage to explain very clearly what I was doing. I am talking about cache pre-loading, in order to have the files in disk cache when later accessing them in a way that would be slow to do on the physical disk drive. Here is a web page where I have tried to explain more in detail, with pictures, C code and measurements.

However, this has (at best) marginal relevance to the original question.

Upvotes: 4

Peter Johnson
Peter Johnson

Reputation: 1

This problem depends heavily on the average file size. On the other hand, a simple computative trade-off can be made.

Checking file size

The easiest thing to check is the file size. We can compute this on a POSIX-compliant machine with A C standard library installed.

We can get the file info from the filename with the stat function.

#include <sys/stat.h>
// ...
struct stat st;
stat(filename, &st);

Then just access the file size with st_size.

printf("%d\n", st.st_size);

Optional: Checking File Permissions

If you need to, you can check the file permissions with st_mode.

st.st_mode

Size Problem

Depending on the size of the file, it may be better to actually compare without a checksum. For medium/large files, you can use a speedy CRC implementation (this is just one I typed up):

uint32_t crc32c(uint32_t crc, const unsigned char *buf, size_t len) {
    int q;

    crc = ~crc;
    while (len--) {
        crc ^= *buf++;
        for (q = 0; q < 8; q++)
            crc = crc & 1 ? (crc >> 1) ^ 0x82f63b78 : crc >> 1; // CRC iSCSI
    }
    return ~crc;
}

This CRC implementation is relatively standard. Depending on the file size as well, the hash function may be different. This implementation DOES not contain a lookup table, which you would want.

Note: For x86 systems, crc = (crc >> 1) ^ (0x82f63b78 & (0 - (crc & 1))) is slightly faster. Both are sound.

For a much faster tableless algorithm (credits to Hagai Gold and Stephen Brumme):

uint32_t crc32_1byte_tableless(const void* data, size_t length, uint32_t previousCrc32)
{
  uint32_t crc = ~previousCrc32; // same as previousCrc32 ^ 0xFFFFFFFF
  const uint8_t* current = (const uint8_t*) data;
  while (length-- != 0)
  {
    uint8_t s = uint8_t(crc) ^ *current++;
    // Hagai Gold made me aware of this table-less algorithm and send me code
    // polynomial 0xEDB88320 can be written in binary as 11101101101110001000001100100000b
    // reverse the bits (or just assume bit 0 is the first one)
    // and we have bits set at position 0, 1, 2, 4, 5, 7, 8, 10, 11, 12, 16, 22, 23, 26
    // => those are the shift offsets:
    //crc = (crc >> 8) ^
    //       t ^
    //      (t >>  1) ^ (t >>  2) ^ (t >>  4) ^ (t >>  5) ^  // == y
    //      (t >>  7) ^ (t >>  8) ^ (t >> 10) ^ (t >> 11) ^  // == y >> 6
    //      (t >> 12) ^ (t >> 16) ^                          // == z
    //      (t >> 22) ^ (t >> 26) ^                          // == z >> 10
    //      (t >> 23);
    // the fastest I can come up with:
    uint32_t low = (s ^ (s << 6)) & 0xFF;
    uint32_t a   = (low * ((1 << 23) + (1 << 14) + (1 << 2)));
    crc = (crc >> 8) ^
          (low * ((1 << 24) + (1 << 16) + (1 << 8))) ^
           a ^
          (a >> 1) ^
          (low * ((1 << 20) + (1 << 12)           )) ^
          (low << 19) ^
          (low << 17) ^
          (low >>  2);
    // Hagai's code:
    /*uint32_t t = (s ^ (s << 6)) << 24;
    // some temporaries to optimize XOR
    uint32_t x = (t >> 1) ^ (t >> 2);
    uint32_t y = x ^ (x >> 3);
    uint32_t z = (t >> 12) ^ (t >> 16);
    crc = (crc >> 8) ^
           t ^ (t >> 23) ^
           y ^ (y >>  6) ^
           z ^ (z >> 10);*/
  }
  return ~crc; // same as crc ^ 0xFFFFFFFF
}

Technically you can create a lookup table that is hilariously large on most flash memory chips (up to 4 GB), it's a logarithmic tradeoff.

For extremely large files, i.e terabytes, it may be beneficial to use xxHash.

Raw computation

If the average file size is (heuristically computed) below 52 bytes, you may benefit from the manual comparison. I will not provide a comparison C code as this post is relatively long.

Conclusion

The two (or three) step process is as follows:

  • Compare file sizes
  • Hash and check with first hash

Upvotes: 0

deadcow
deadcow

Reputation: 1

I would first create a database table with columns pathname and sha_1 of file_contents,
all the files and store the pathName and sha_1,
then upon consequent storing put it in a database,
sha_1 file check if sha_1 exists in db,
if in db,
output to a log that that file existed with pathname,
do whatever with it lol create a symlink.
upon file upload implement it in your validation,

Upvotes: 0

md27
md27

Reputation: 95

If you want to compare files one by one, use ExamDiff.

Upvotes: -1

mikeserv
mikeserv

Reputation: 694

In my opinion, this is a file-system operation. So first, choose your filesystem with care. Next, deduplicate. Then compare inodes. Like:

% find / -inum "$(ls -di "./test.file" | grep -E '^[0-9]*')"
<list of identical files provided in a few seconds to a minute>

Upvotes: 0

janetsmith
janetsmith

Reputation: 8722

Use the concept of Bloom Filter. A simple explanation here: http://crzyjcky.com/2013/01/03/the-magical-bloom-filter/

It gives you constant time of comparing. However this method cannot be used alone. Apache Cassandra and HBase are using this technique internally.

It basically tells u the files are not identical in very fast way. If it says the file are identical, you have to do another round of checking using reliable method.

Upvotes: 1

Ryan
Ryan

Reputation: 1

I have just written a c# app that does something similar to what you want. What my code does is this.

Read all off the sizes of each file into a list or array.

Use a for loop to check if any of these sizes are the same. if they are the same size, compare a byte of one file to a byte of the other file. If the two bytes are the same, move onto the next byte. If a difference is found, return that the files are different.

If the end of both files is reached, and the final two bytes are the same, the files must be identical.

I have experimented with comparing MD5 hashes of files rather than going through byte for byte, and I have found that identical files are often missed with this method, however it is significantly faster.

Upvotes: 0

David Z
David Z

Reputation: 131550

I'd opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you've checked all files. I don't think there's any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

OP Modification: Lifted up important comment from Mark Bessey

"another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once."

Upvotes: 28

Doug Bennett
Doug Bennett

Reputation: 175

Most people in their responses are ignoring the fact that the files must be compared repeatedly. Thus the checksums are faster as the checksum is calculated once and stored in memory (instead of reading the files sequentially n times).

Upvotes: 13

Michael Burr
Michael Burr

Reputation: 340168

Assuming that the expectation is that the files will be the same (it sound like that's the scenario), then dealing with checksums/hashes is a waste of time - it's likely that they'll be the same and you'd have to re-read the files to get the final proof (I'm also assuming that since you want to "prove ... they are the same", that having them hash to the same value is not good enough).

If that's the case I think that the solution proposed by David is pretty close to what you'd need to do. A couple things that could be done to optimize the comparison, in increasing level of complexity:

  • check if the file sizes are the same before doing the compare
  • use the fastest memcmp() that you can (comparing words instead of bytes - most C runtimes should do this already)
  • use multiple threads to do the memory block compares (up to the number of processors available on the system, going over that would cause your thread to fight each other)
  • use overlapped/asynchronous I/O to keep the I/O channels as busy as possible, but also profile carefully so you thrash between the files as little as possible (if the files are divided among several different disks and I/O ports, all the better)

Upvotes: 9

mark
mark

Reputation:

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

if you have control over the output have the program creating the files / output create an md5 on the fly and embed it in the file or output stream or even pipe the output through a program that creates the md5 along the way and stores it along side the data somehow, point is to do the calculations when the bytes are already in memory.

if you can't pull this off then like others have said, check file sizes then do a straight byte by byte comparison on same sized files, i don't see how any sort of binary division or md5 calculation is any better than a straight comparison, you will have to touch every byte to prove equality any way you cut it so you might as well cut the amount of computation needed per byte and gain the ability to cut off as soon as you find a mis-match.

the md5 calculation would be useful if you plan to compare these again later to new outputs but your basically back to my first point of calculating the md5 as soon as possible

Upvotes: 5

bo.
bo.

Reputation: 9

beyond compare, sync two folders, super fast! we use it all the time, everyday.

Upvotes: 0

BeWarned
BeWarned

Reputation: 2338

I don't think hashing is going to be faster than byte by byte comparisons. The byte by byte comparison can be optimized a bit by pipelining the reading and comparision of the bytes, also multiple sections of the file could be compared in parallel threads. It would be go something like this:

  • Check if the files sizes differ
  • Read blocks of the files into memory asynchronously
  • Handle them off to worker threads to do the comparisons

Or just run a cmp's (or the equivalent for your OS) in parallel. This could be scripted easily and you still get the benefit of parallelism.

Upvotes: 1

Peter Wone
Peter Wone

Reputation: 18739

First compare the file lengths of all million. If you have a cheap way to do so, start with the largest files. If they all pass that then compare each file using a binary division pattern; this will fail faster on files that are similar but not the same. For information on this method of comparison see Knuth-Morris-Pratt method.

Upvotes: 4

NitroxDM
NitroxDM

Reputation: 5131

Why reinvent the wheel? How about a third party app? Granted it doesn't have APIs but I don't imagine you put your self in this situation often. I like this app doublekiller just make a backup before you start. :) It's fast and free!

Upvotes: 0

Blair Zajac
Blair Zajac

Reputation: 4625

I would run something like this

find -name \*.java -print0 | xargs -0 md5sum | sort

then see which files have different MD5 sums. This will group the files by checksum.

You can replace md5sum which sha1sum or even rmd160 if you like.

Upvotes: 0

Sam Saffron
Sam Saffron

Reputation: 131092

Well the most optimal algorithm will depend on the number of duplicate files.

Assuming a few are the same but most are different and the files are big.

Filter out the ones that are obviously not the same using a simple file length check.

Choose random bytes from the file, calculate a hash and compare (minimizing disk seeks)

Follow that with a full file SHA1.

Upvotes: 1

sangupta
sangupta

Reputation: 2406

MD5 hash would be faster than comparison, but slower than a normal CRC-check. You have to figure out the kind of reliability you want in comparison.

Upvotes: -1

paxdiablo
paxdiablo

Reputation: 881093

Using cksum is not as reliable as using something like md5sum. But I would opt for maximum reliability, which means a byte-by-byte comparison using cmp.

You have to read every byte in both files for all checking methods so you may as well opt for the one that is most reliable.

As a first pass, you could check the directory listing to see if the sizes are different. That's a quick way to get faster feedback for different files.

Upvotes: 2

Related Questions