Durga Swaroop
Durga Swaroop

Reputation: 583

How does Git compute SHA hashes so quickly?

I know git is fast but I have only recently found how fast it really can be.

In one of my projects, I am trying to compute SHA-256 hashes of a huge file (82 MB with 850k rows) and it took over a minute to compute it (includes hashing and a couple of other small operations).

Even with SHA-1, it took me 30+ seconds while git seems to be doing it in just a second or two.

I am computing the hash in Scala using java's Security API by combining all the lines of the file.

val lines = Source.fromFile(filePath, "UTF-8").getLines().toList
MessageDigest.getInstance("SHA-256")
.digest(lines.mkString("\n").getBytes).map("%02x".format(_)).mkString

So, how does Git do it so quickly, or rather the more important question, why is my method so slow?

Edit: For those not familiar with scala syntax, lines will have all the lines of the file in a List and mkString method returns a string of all the elements in the list combined with the given separator.

Upvotes: 0

Views: 575

Answers (3)

Roman Puchkovskiy
Roman Puchkovskiy

Reputation: 11845

Reposting my earlier comment (extended).

What you do is:

  1. Read bytes
  2. Convert them to characters
  3. Split the character stream to lines
  4. Store those lines into a list
  5. Again concatenate those lines into a string
  6. Take its bytes again
  7. Compute hash of those bytes

Steps 2-6 seem unnecessary. I would recommend to read bytes from your initial FileInputStream in chunks (for example, of 4k) and feed them to the MessageDigest for update. That would just perform steps 1 and 7.

InputStream is = new FileInputStream(fileName);
byte[] buffer = new byte[4096];
while (true) {
    int read = is.read(buffer);
    if (read < 0) {
        break;
    }
    md.update(buffer, 0, read);
}
is.close(); // better be done in finally

As for sha1 performance, here is what I got for time sha1sum <file> where file is 179Mb:

real    0m0.607s
user    0m0.588s
sys 0m0.016s

Upvotes: 7

Aaron Brock
Aaron Brock

Reputation: 4536

Git undoubtedly is faster, but 30 seconds for SHA-1 is not so great.

So I ran a test in java:

public static void main(String[] args) throws Exception{
    long startTime = System.currentTimeMillis();

    byte[] bytes = createSha1(new File("src\\main\\resources\\200mb_file.zip"));
    System.out.println(new String(bytes));

    long endTime = System.currentTimeMillis();
    long duration = (endTime - startTime);
    System.out.format("Duration: %dms\n", duration);
}

private static byte[] createSha1(File file) throws Exception  {
    MessageDigest digest = MessageDigest.getInstance("SHA-1");
    InputStream fis = new FileInputStream(file);
    int n = 0;
    byte[] buffer = new byte[8192];
    while (n != -1) {
        n = fis.read(buffer);
        if (n > 0) {
            digest.update(buffer, 0, n);
        }
    }
    return digest.digest();
}

Output:

Duration: 1531

My guess what is causing your slowness is the fact you are inputting it to a list rather the directly using it as a stream.

Upvotes: 1

torek
torek

Reputation: 488453

The hash computation is redirected at compile time to a specific implementation in cache.h. The underlying platform may provide an optimized (e.g., assembler or machine-dependent C coded) hash routine. Of course, your Java implementation may or may not also provide such a routine.

If the platform does not have its own implementation, Git provides one written in C that works on large memory blocks, and still has some hand-tweaks and inline asms with architecture and compiler ifdefs.

Upvotes: 1

Related Questions