Reputation: 583
I know git is fast but I have only recently found how fast it really can be.
In one of my projects, I am trying to compute SHA-256 hashes of a huge file (82 MB with 850k rows) and it took over a minute to compute it (includes hashing and a couple of other small operations).
Even with SHA-1, it took me 30+ seconds while git seems to be doing it in just a second or two.
I am computing the hash in Scala using java's Security
API by combining all the lines of the file.
val lines = Source.fromFile(filePath, "UTF-8").getLines().toList
MessageDigest.getInstance("SHA-256")
.digest(lines.mkString("\n").getBytes).map("%02x".format(_)).mkString
So, how does Git do it so quickly, or rather the more important question, why is my method so slow?
Edit: For those not familiar with scala syntax, lines
will have all the lines of the file in a List
and mkString
method returns a string of all the elements in the list combined with the given separator.
Upvotes: 0
Views: 575
Reputation: 11845
Reposting my earlier comment (extended).
What you do is:
Steps 2-6 seem unnecessary. I would recommend to read bytes from your initial FileInputStream
in chunks (for example, of 4k) and feed them to the MessageDigest
for update. That would just perform steps 1 and 7.
InputStream is = new FileInputStream(fileName);
byte[] buffer = new byte[4096];
while (true) {
int read = is.read(buffer);
if (read < 0) {
break;
}
md.update(buffer, 0, read);
}
is.close(); // better be done in finally
As for sha1 performance, here is what I got for time sha1sum <file>
where file is 179Mb:
real 0m0.607s
user 0m0.588s
sys 0m0.016s
Upvotes: 7
Reputation: 4536
Git undoubtedly is faster, but 30 seconds for SHA-1 is not so great.
So I ran a test in java:
public static void main(String[] args) throws Exception{
long startTime = System.currentTimeMillis();
byte[] bytes = createSha1(new File("src\\main\\resources\\200mb_file.zip"));
System.out.println(new String(bytes));
long endTime = System.currentTimeMillis();
long duration = (endTime - startTime);
System.out.format("Duration: %dms\n", duration);
}
private static byte[] createSha1(File file) throws Exception {
MessageDigest digest = MessageDigest.getInstance("SHA-1");
InputStream fis = new FileInputStream(file);
int n = 0;
byte[] buffer = new byte[8192];
while (n != -1) {
n = fis.read(buffer);
if (n > 0) {
digest.update(buffer, 0, n);
}
}
return digest.digest();
}
Output:
Duration: 1531
My guess what is causing your slowness is the fact you are inputting it to a list rather the directly using it as a stream.
Upvotes: 1
Reputation: 488453
The hash computation is redirected at compile time to a specific implementation in cache.h. The underlying platform may provide an optimized (e.g., assembler or machine-dependent C coded) hash routine. Of course, your Java implementation may or may not also provide such a routine.
If the platform does not have its own implementation, Git provides one written in C that works on large memory blocks, and still has some hand-tweaks and inline asm
s with architecture and compiler ifdef
s.
Upvotes: 1