I am trying to find all duplicates of a given unique file on a file server. Here is what I have done: Get the unique file's hash code. Compare the unique file's hash code to every file's hash code on the file server. If equal, add to a list of duplicates. This does the job but takes forever ( I have 200k files on the file server) so I had to think of something else and this is what I did: Get the unique file's hash code. Get the unique file's size in bytes. Get a list of all files that have the same file size (this is very fast as I do not need to read the files) Compare the unique file's hash code to every file of the shortlisted files. This has reduced the amount of time required to do the task from a few hours to 10 minutes but this is still not good especially when trying to find duplicates for a bunch of files. Each file search taking 10 minutes means a 100 files are going to take 16 hours! Is there a Unique File Identifier better than the hash code? getting the hash code is the time consuming thing in the process. Thank you,

Reputation: 932

Finding file duplicates - C# preferably

I am trying to find all duplicates of a given unique file on a file server. Here is what I have done:

Get the unique file's hash code.
Compare the unique file's hash code to every file's hash code on the file server. If equal, add to a list of duplicates.

This does the job but takes forever ( I have 200k files on the file server) so I had to think of something else and this is what I did:

Get the unique file's hash code.
Get the unique file's size in bytes.
Get a list of all files that have the same file size (this is very fast as I do not need to read the files)
Compare the unique file's hash code to every file of the shortlisted files.

This has reduced the amount of time required to do the task from a few hours to 10 minutes but this is still not good especially when trying to find duplicates for a bunch of files. Each file search taking 10 minutes means a 100 files are going to take 16 hours!

Is there a Unique File Identifier better than the hash code? getting the hash code is the time consuming thing in the process.

Thank you,

Upvotes: 4

Answers (2)

MarkO

Reputation: 2233

Searching for duplicates by hashcode is definitly the slowest way possible; a lot of disk i/o and cpu processing.

I have some experience in this field, and we found the fastert approach was to eliminate files as soon as possible:

Short list by grouping on file size
Compare the first 256 bytes of all those files per group, byte by byte. This should eliminate a lot of files.
Loop on step 2, but double the buffer size on each iteration (with a max of 16K or so)

Opening/closing all those file handles in loop is indeed a slight overhead, but not as much as completely reading all the files.

Upvotes: 4

Barak Itkin

Reputation: 4877

Well, since this question deals with optimizing the running time by constant factors and not orders of magnitude, then we'll have to be a bit more specific about the type of files you are actually dealing with.

You currently have two methods of comparing files - getting their size (fast and in-accurate) and getting their hash ("slow" and accurate enough). The problem is that computing the hashes of files may take a while when file sizes are not negligible.

So, depending on the types of inputs you actually have, you may be able to come with more comparison operations that are somewhere between these two (slower than file size, but more accurate). For example:

If your inputs are text files, comparing the hashes on their first sentence (or the X first characters) may be very good (assuming not all of them are letters that start with "Hello World." or any other common template).
If your inputs are image files, comparing their internal metadata (such as time-taken / geo-tagging / some other field with not so common values) can also result in a relatively fast comparison that would resolve many potential equalities
If your input files are simply random files on a file-sharing website, reading their first few bytes should distinguish many of them one from the other due to the file-format headers (or even better than that - if your users are not doing things like naming files "Hello.jpg" and "Hello.jpeg", then distinguishing the files by their suffix would also be a fast comparison)

In general, if you have many files with similar sizes (which is why you actually work hard on later computing hashes), then there is a high chance that these files have something in common. Given the fact that you know the types of inputs better than us right now, try coming up with comparison criterias that don't require you to read the entire file and so should be faster.

Finally, when you have all your comparison criterias - apply them to create "buckets" of inputs (lists of inputs with the same result from the criteria), starting with the fastest criteria and then applying the slower ones inside each bucket that has more than one input.

Upvotes: 2

Finding file duplicates - C# preferably

Answers (2)

Related Questions