Reputation: 495
We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:
unsigned long GetFileCheckSum(CString PathFilename)
{
FILE* File;
unsigned long CheckSum = 0;
unsigned long Data = 0;
unsigned long Count = 0;
if ((File = fopen(PathFilename, "rb")) != NULL)
{
while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
{
CheckSum ^= Data + ++Count;
Data = 0;
}
fclose(File);
}
return CheckSum;
}
I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)
How could it be improved without a huge performance hit?
Lastly, what's going on with the fread()
? I had a quick scan of the documentation but I couldn't figure it out. Is Data
being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long
(let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd
, what is the value of Data
on the first pass? Is it (in Perl):
(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')
Upvotes: 5
Views: 4682
Reputation: 15788
SHA-1 and (more recently SHA-2) provide excellent hashing functions and I believe as slowly supplanting MD5 due to better hashing properties. All of them (md2, sha, etc...) have efficient implementations and return a hash of a buffer that is several characters long (although always a fixed length). are provably more reliable than reducing a hash to an integer. If I had my druthers, I'd use SHA-2. Follow this link for libraries that implement SHA checksums.
If you don't want to compile in those libraries, linux (and probably cygwin) has the following executables: md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum; to which you can provide your file and they will print out the checksum as a hex string. You can use popen to execute those programs -- with something like this:
const int maxBuf=1024;
char buf[maxBuf];
FILE* f = popen( "sha224sum myfile", "w" );
int bytesRead = f.read( buf, maxBuf );
fclose( f );
Obviously this will run quite a lot slower, but makes for a useful first pass. If speed is an issue, given that file hashing operations like this and I/O bound (memory and disk access will be you bottlenecks), I'd expect all of this algorithms to run about as fast a one that produces an unsigned int. Perl and Python also come with implementations of MD5 SHA1 and SHA2 and will probably run as fast as in C/C++.
Upvotes: 0
Reputation: 180787
MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.
See also Robust and fast checksum algorithm?
Upvotes: 7
Reputation: 36402
I'd suggest you take a look at Fletcher's checksum, specifically fletcher-32, which ought to be fairly fast, and detect various things the current XOR chain would not.
Upvotes: 5
Reputation: 49719
You could easily improve the algorithm by using a formula like this one:
Checksum = (Checksum * a + Data * b) + c;
If a, b and c are large primes, this should return good results. After this, rotating (not shifting!) the bits of checksum will further improve it a bit.
Using primes, this is a similar algorithm to that used for Linear congruential generators - it guarantees long periods and good distribution.
Upvotes: 3
Reputation: 43110
{
CheckSum ^= Data + ++Count;
Data = 0;
}
I don't think "++Count" do much work. The code is equivalent with
{
CheckSum ^= Data;
}
XORing a sequence of bytes is not enough. Especially with text files.
I suggest to use a hash function.
Upvotes: 0
Reputation: 29240
I seems like your algorithm makes no effort to deal with files that are not an exact multiple of 4 bytes in size. The return value of fread is not a boolean but the number of bytes actually read, which will differ from 4 in the case of an EOF or if an error occurred. You are checked for neither, but simply assuming that if it didn't return 0, you have 4 valid bytes in 'data' which which to calculate your hash.
If you really want to use a hash, I'd recommend several things. First, use a simple cryptographic hash like MD5, not CRC32. CRC32 is decent for checking data validity, but for spanning a file system and ensuring no collisions, its not as great a tool because of the birthday paradox mentioned in the comments elsewhere. Second, don't write the function yourself. Find an existing implementation. Finally, consider simply using rsync to replicate files instead of rolling your own solution.
Upvotes: 0
Reputation: 269667
Even "expensive" cryptographic hash functions usually require multiple iterations to take significant amounts of time. Although no longer recommended for cryptographic purposes, where users would deliberately try to create collisions, functions like SHA1 and MD5 are widely available and suitable for this purpose.
If a smaller hash value is needed, CRC is alright, but not great. A n-bit CRC will fail to detect a small fraction of changes that are longer than n bits. For example, suppose just a single dollar amount in a file is changed, from $12,345 to $34,567. A 32-bit CRC might miss that change.
Truncating the result of a longer cryptographic hash will detect changes more reliably than a CRC.
Upvotes: 0
Reputation: 78536
The fread
bit is reading in the file one chunk at a time. Each chunk is the size of a long (in c this is not a well defined size but you can assume 32 or 64 bits). Depending on how it gets buffered, this might not be to bad. OTOH, reading a larger chunk into an array and looping over it might be a lot faster.
Upvotes: 0