Greg
Greg

Reputation: 1853

Can you SHA1 hash a file to a certain length

I need to compare two files. One file may be longer than the other and I need to check to see if the longer file contains all of the data of the shorter file. I could do a binary compare of the two something like this:

function compareFiles($file_a, $file_b){
  if (filesize($file_a) > filesize($file_b)){
    $fp_a = fopen($file_b, 'rb');
    $fp_b = fopen($file_a, 'rb');
  } else { // filesize($file_b) > filesize($file_a)
    $fp_a = fopen($file_a, 'rb');
    $fp_b = fopen($file_b, 'rb');
  }
  while (($b = fread($fp_a, 4096)) !== false){
    $b_b = fread($fp_b, 4096);
    if ($b !== $b_b){
      fclose($fp_a);
      fclose($fp_b);
      return false;
    }
  }
  fclose($fp_a);
  fclose($fp_b);
  return true;
}

but this would be slow. As an alternative, I could compare the SHA1 hash of the smaller file against the SHA1 hash of the larger file up and until the size of the smaller file, something like this:

function compareFiles($file_a, $file_b){
  $tmpfile = '/dev/shm/tmp_file_copy.bin';
  if (filesize($file_a) > filesize($file_b)){
    $readfromfile = $file_b;
    $bytes_to_copy = filesize($file_b);
  } else {
    $readfromfile = $file_a
    $bytes_to_copy = filesize($file_a);
  }
  $readfile = fopen($readfromfile, 'rb');
  $writefile = fopen($tmpfile, 'wb');
  while (!feof($readfile) && $bytes_to_copy> 0) {
    if ($bytes_to_copy <= 8192) {
      $contents = fread($readfile, $bytes_to_copy);
      $bytes_to_copy = 0;
    } else {
      $contents = fread($readfile, 8192);
      $bytes_to_copy =- 8192;
    }
    fwrite($writefile, $contents);
  }
  fclose($writefile);
  fclose($readfile);
  $result = sha1_file($readfromfile = $file_a ? $file_b : $file_a) === sha1_file($tmpfile);
  unlink($tmpfile);
  return $result;
}

but I fear that this would also be slow as it involves a lot of I/O (to /dev/shm).

In short, I'm looking for a better way...

Upvotes: 1

Views: 433

Answers (2)

Patrick M
Patrick M

Reputation: 10989

Hashing the files in this case will only be slower. Consider the following case.

File A.txt contents:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

File B.txt contents:

AAAAAAAAAAAAAAAAAAAAABBBBBBBBBB

Note that A.txt is 40 characters total, 10 characters longer than B.txt at 30 characters

How much I/O do we have to do on each file to determine if A.txt contains all of B.txt? 40 bytes? 30 bytes? No, the answer is only 20 bytes, because that's how much the two files share have in common. You stream each file one byte (or chunk of bytes) at a time and compare them as you go. The results of such a comparison looks like this:

A.txt: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

B.txt: AAAAAAAAAAAAAAAAAAAAABBBBBBBBBB

Stream ---------------------^
Result ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓X

And then you stop. Why compare the rest?

If you hash both files, you have to have the whole contents in memory to compute the hash. Even if you hash them in chunks while streaming them into memory, which do you think is faster: comparing each byte from each file or hashing the chunk? The complexity of the comparison is O(number of bytes) whereas the complexity of the SHA-1 hash algorithm is specified by RFC 3174.

Upvotes: 2

Marcin Szwarc
Marcin Szwarc

Reputation: 581

The byte-by-byte is the best method in your case. It only compares first x bytes of both files and stops if they are different. Hash function must process all bytes in file. Isn't that slower?

Upvotes: 1

Related Questions