Reputation: 27952

Efficient ways of telling whether or not a string/file has changed - crc32? md5? something else?

I'm looking for an efficient way to tell whether or not a string (or a file) has changed since the last time we looked at it.

So, we run this function against 1,000,000 files/strings (each file/string is less than 1000 bytes), and store the output for each file/string.

I'll then wait a few days and run this again. I need to find out whether or not each file has changed or not...

Should I calculate CRC32s for each file? MD5? Something else more efficient?

Is CRC32 good enough for telling me whether or not a file/string has changed?

EDIT It has to work both both files and strings, so timestamps on the files are out of the question.

Upvotes: 3

Answers (8)

greg-tumolo

Reputation: 696

For completeness: CRC32 and MD5 may tell a string has not changed when, in fact, it has (because there exist unique strings with the same CRC32 or MD5).

Upvotes: 0

svens

Reputation: 11628

You said the data whould be around one million 1kB strings/files and you want to check it every few days. If this is true you really don't have to worry about performance, because processing 1GB of data won't take that long, it doesn't matter if you use crc32 or md5.

I suggest using md5, because it's less likely to collide than crc32. Crc32 will do the job, but you can get a better result without investing much more.

Edit: As someone else stated comparing the strings to a backup is faster. (Because you can abort as soon as two chars differ) This is not 100% true if you have to read the String out of a file. If we assume that the strings come out of files and you use md5 you'll have to read 32 bytes plus the average of the string lengths for every string you want to compare. When you compare byte by byte you'll have to read in minimum 2 bytes and in maximum tow times the string length. So if many of your strings have the same beginning (more chars than 32 + the average of the string lengths are equal) you'll be faster with a hash. (Correct me if I'm wrong) Because this is a theoretical case you'll be fine to stick with a char by char comparison. If the average of the string lengths is bigger than 32 bytes, you'll save disk space when using a hash ;-).

But as I already stated above; performance won't be your problem when dealing with that ammout of data.

Upvotes: 1

fbrereto

Reputation: 35925

String comparison will be more efficient than either crc32 or md5, or any other hash algorithm proposed.

For starters you can bail out of a string comparison as soon as the two strings are different, whereas with a hashing algorithm you have to hash the entire contents of the file before you can make a comparison.

What is more, hashing algorithms have operations they must perform to generate the hash, whereas a string comparison is checking for equality between two values.

I'd imagine a string-based comparison of the files/strings that short-circuits on the first failure (per-file/string) will get you good performance.

Upvotes: 1

Mike Dunlavey

Reputation: 40669

For the files you could use the timestamp.

For the strings, you could keep a backup copy.

Just comparing them and re-writing the backup might be as fast as CRC or MD5.

Upvotes: 1

EvilTeach

Reputation: 28837

CRC32, or CRC64 will do the job just fine.

You might even be able to use it as a basis for some sort of hash lookup.

Upvotes: 1

jsr

Reputation: 1329

I use MD5 for this type of thing, seems to work well enough. If you're using .NET, see System.Security.Cryptography.MD5CryptoServiceProvider.

Upvotes: 0

Martin

Reputation:

In Java you can do:

File file = new File(filePath);

file.lastModified();

Upvotes: 0

Joe

Reputation: 42627

For files, do you have to look at the content? The filesystem will track a modified timestamp.

Upvotes: 1

Efficient ways of telling whether or not a string/file has changed - crc32? md5? something else?

Answers (8)

Related Questions