Reputation: 932
I am trying to find all duplicates of a given unique file on a file server. Here is what I have done:
This does the job but takes forever ( I have 200k files on the file server) so I had to think of something else and this is what I did:
This has reduced the amount of time required to do the task from a few hours to 10 minutes but this is still not good especially when trying to find duplicates for a bunch of files. Each file search taking 10 minutes means a 100 files are going to take 16 hours!
Is there a Unique File Identifier better than the hash code? getting the hash code is the time consuming thing in the process.
Thank you,
Upvotes: 4
Views: 923
Reputation: 2233
Searching for duplicates by hashcode is definitly the slowest way possible; a lot of disk i/o and cpu processing.
I have some experience in this field, and we found the fastert approach was to eliminate files as soon as possible:
Opening/closing all those file handles in loop is indeed a slight overhead, but not as much as completely reading all the files.
Upvotes: 4
Reputation: 4877
Well, since this question deals with optimizing the running time by constant factors and not orders of magnitude, then we'll have to be a bit more specific about the type of files you are actually dealing with.
You currently have two methods of comparing files - getting their size (fast and in-accurate) and getting their hash ("slow" and accurate enough). The problem is that computing the hashes of files may take a while when file sizes are not negligible.
So, depending on the types of inputs you actually have, you may be able to come with more comparison operations that are somewhere between these two (slower than file size, but more accurate). For example:
In general, if you have many files with similar sizes (which is why you actually work hard on later computing hashes), then there is a high chance that these files have something in common. Given the fact that you know the types of inputs better than us right now, try coming up with comparison criterias that don't require you to read the entire file and so should be faster.
Finally, when you have all your comparison criterias - apply them to create "buckets" of inputs (lists of inputs with the same result from the criteria), starting with the fastest criteria and then applying the slower ones inside each bucket that has more than one input.
Upvotes: 2