Francesco Boffa
Francesco Boffa

Reputation: 1402

Dumping a physical disk on a file and hashing it at the same time

I've created a WPF application which reads a physical device (hard disk, usb mass storage) to a file. I get speeds up to 75-80 MB/s for HDs and 20-25 for USBs. I wanted to add MD5 and SHA1 hashing of the image on the fly. Basically I read a block of 128 sectors (64Kb) from the disk, I hash it using the two algos, and then I write the block to another file.

Well the two hashing functions seem to be a huge bottleneck. The speed went down to 5 Mb/s for Usb and 20 Mb/s for HD.

So I tought to put the hasing code in another thread. So one thread reads the blocks and puts the value in a FIFO list (made using List). Then another thread pops out a block and hases it. The problem is that the thread reading from the disk is faster than the hashing one and the List would grow exponentially. So I put a limit on the list of 1024 blocks. When the first thread sees that the list has 1024 blocks or more, it sleeps until it gets to 512...

This seems to work for the first few seconds. I get 19 Mb/s for the Usb. But just after a bit, it tends to get slower again. I suppose it filled the list and it's waiting for it to have some space...

Maybe my hasing functions are slow? I copied the first one that I found on the web... How can I enhance the speed of my application?

Thanks

Upvotes: 0

Views: 317

Answers (3)

Will Dean
Will Dean

Reputation: 39530

Clearly, given enough time, the throughput is going to be limited at the performance of whichever is the slower part of the process. The advantage of making the IO asynchronous to the hashing is just that you can keep both processes going together, not that you'll magically be faster than either of them individually.

It's hard to know why your hashing is so slow, but .NET contains both MD5 and SHA implementations, so you shouldn't need to be writing your own.

Presumably if you have two threads accessing one List (a Queue or ConcurrentQueue might have been better), you have some locking around it. Are you sure you're not holding a lock for a long time on one thread so the other thread gets blocked?

Ideally, you need to run a profiler of some kind, but you might be able to use Stopwatch and some trace to work out what's happening.

Upvotes: 1

Jeffrey Hantin
Jeffrey Hantin

Reputation: 36534

This is a relatively simple yet interesting performance puzzle, isn't it?

It certainly sounds to me like you are CPU-bottlenecked in the hash algorithm implementation. For a performant hash algorithm, rather than just copying something random, use the standard classes in System.Security.Cryptography such as SHA1CryptoServiceProvider and MD5CryptoServiceProvider.

If you have multiple cores available, consider splitting up the hashing work into separate threads. As a general rule of thumb, for n cores* use n+1 threads; if you have multithreaded cores (such as Intel HT), you may gain or lose performance by using them. The Task Parallel Library may help with this, especially since the input read loop can be easily rewritten as an iterator.


* For example, on a Pentium IV Prescott chip, approximately 10% performance is lost by using both cores in the standard Bitcoin client, which largely sits in a loop running hashes.

Upvotes: 0

titus
titus

Reputation: 5784

You could try out non-cryptographic hash functions here
They should be faster than the cryptographic ones

Upvotes: 0

Related Questions