Reputation: 12300
I'm using the following code to do a checksum of a file which works fine. But when I generate a hash for a large file, say 2 GB, it is quite slow. How can I improve the performance of this code?
fs = new FileStream(txtFile.Text, FileMode.Open);
formatted = string.Empty;
using (SHA1Managed sha1 = new SHA1Managed())
{
byte[] hash = sha1.ComputeHash(fs);
foreach (byte b in hash)
{
formatted += b.ToString("X2");
}
}
fs.Close();
Update:
System:
OS: Win 7 64bit, CPU: I5 750, RAM: 4GB, HDD: 7200rpm
Tests:
Test1 = 59.895 seconds
Test2 = 59.94 seconds
Upvotes: 6
Views: 6480
Reputation: 113292
The first question is what you need this checksum for. If you don't need the cryptographic properties, then a non-cryptographic hash, or a hash that is less cryptographically secure (MD5 being "broken" doesn't prevent it being a good hash, nor still strong enough for some uses) is likely to be more performant. You could make your own hash by reading a subset of the data (I'd advise making this subset work in 4096byte chunks of the underlying file, as that would match the buffer size used by SHA1Managed as well as allowing for a faster chunk read than you would if you did say every X bytes for some value of X).
Edit: An upvote reminding me of this answer, has also reminded me that I since wrote SpookilySharp which provides high-performance 32-, 64- and 128-bit hashes that are not cryptographic, but good for providing checksums against errors, storage, etc. (This in turn has reminded me that I should update it to support .NET Core).
Of course, if you want the SHA-1 of the file to interoperate with something else, you are stuck.
I would experiment with different buffer sizes, as increasing the size of the filestream's buffer can increase speed at the cost of extra memory. I would advise a whole multiple of 4096 (4096 is the default, incidentally) as SHA1Managed will ask for 4096 chunks at a time, and this way there'll be no case where either FileStream returns less than the most asked for (allowed but sometimes suboptimal) or does more than one copy at a time.
Upvotes: 3
Reputation: 393
Neither is SHA1Managed the best choice for large input strings, nor is Byte.ToString("X2") the fastest way to convert the byte array to a string.
I just finished an article with detailed benchmarks on that topic. It compares SHA1Managed, SHA1CryptoServiceProvider, SHA1Cng and also considers SHA1.Create() on different length input strings.
In the second part, it shows 5 different methods of converting the byte array to string where Byte.ToString("X2") is the worst.
My largest input was only 10,000 characters so you may want to run my benchmarks on your 2 GB file. Would be quite interesting if/how that changes the numbers.
http://wintermute79.wordpress.com/2014/10/10/c-sha-1-benchmark/
However, for file integrity checks you are better off using MD5 as you already wrote.
Upvotes: 0
Reputation: 303
You Can Use This logic for getting SHA-1 value. I was using it in java.
public class sha1Calculate {
public static void main(String[] args)throws Exception
{
File file = new File("D:\\Android Links.txt");
String outputTxt= "";
String hashcode = null;
try {
FileInputStream input = new FileInputStream(file);
ByteArrayOutputStream output = new ByteArrayOutputStream ();
byte [] buffer = new byte [65536];
int l;
while ((l = input.read (buffer)) > 0)
output.write (buffer, 0, l);
input.close ();
output.close ();
byte [] data = output.toByteArray ();
MessageDigest digest = MessageDigest.getInstance( "SHA-1" );
byte[] bytes = data;
digest.update(bytes, 0, bytes.length);
bytes = digest.digest();
StringBuilder sb = new StringBuilder();
for( byte b : bytes )
{
sb.append( String.format("%02X", b) );
}
System.out.println("Digest(in hex format):: " + sb.toString());
}catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Upvotes: -3
Reputation: 5008
First: SHA-1 file hashing should be I/O bound on non-ancient CPUs - and I5 certainly doesn't qualify as ancient. Of course it depends on the implementation of SHA-1, but I doubt SHA1Managed is über-slow.
Next, 60sec for 2GB data is ~34MB/s - that's slow for harddisk reads; even a 2.5" laptop disk can read faster than that. Assuming the harddrive is internal (no USB2/whatever or network bottleneck), and there's not a lot of other disk I/O activity, I'd be surprised to see less than 60MB/s read from a modern drive.
My guess would be that ComputeHash()
uses a small buffer internally. Try manually reading/hashing, so you can specify a larger buffer (64kb or even larger) to increase throughput. You could also move to async processing so disk-read and compute can be overlapped.
Upvotes: 1
Reputation: 49719
First of all, have you measured "quite slow"? From this site, SHA-1 has about half the speed of MD5 with about 100 MB/s (depending on the CPU), so 2 GB would take about 20 seconds to hash. Also, note that if you're using a slow HDD, this might be your real bottleneck as 30-70 MB/s aren't unusual.
To speed up things, you might just not hash the whole file, but the first X KB or representable parts of it (the parts that will most likely differ). If your files aren't too similar, this shouldn't cause duplicates.
Upvotes: 1
Reputation: 1500873
Well, is it IO-bound or CPU-bound? If it's CPU-bound, there's not a lot we can do about that.
It's possible that opening the FileStream
with different parameters would allow the file system to do more buffering or assume that you're going to read the file sequentially - but I doubt that will help very much. (It's certainly not going to do a lot if it's CPU-bound.)
How slow is "quite slow" anyway? Compared with, say, copying the file?
If you have a lot of memory (e.g. 4GB or more) how long does it take to hash the file a second time, when it may be in the file system cache?
Upvotes: 1