Dienrop
Dienrop

Reputation: 53

Why is the md5 hash of the tar-part of a tar.gz via TeeReader wrong?

I was just experimenting with archive/tar and compress/gzip, for automated processing of some backups I have.

My problem hereby is: I have various .tar files and .tar.gz files floating around, and thus I want to extract the hash (md5) of the .tar.gz file, and the hash (md5) of the .tar file as well, ideally in one run.

The example code I have so far, works perfectly fine for the hashes of the files in the .tar.gz as well for the .gz, but the hash for the .tar is wrong and I can't find out what the problem is.

I looked at the tar/reader.go file and I saw that there is some skipping in there, yet I thought everything should run over the io.Reader interface and thus the TeeReader should still catch all the bytes.

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tr := tar.NewReader(io.TeeReader(gz, tarMd5))
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

Now for the following example:

$ echo "a" > a.txt
$ echo "b" > b.txt
$ tar cf tb.tar a.txt b.txt 
$ gzip -c tb.tar > tb.tar.gz
$ md5sum a.txt b.txt tb.tar tb.tar.gz

60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
501352dcd8fbd0b8e3e887f7dafd9392  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

On Linux Mint 14 (based on Ubuntu 12.04) with go 1.02 from the Ubuntu repositories the result for my go program is:

$ go run tarmd5.go 
60b725f10c9c85c70d97880dfe8191b3  a.txt
3b5d5c3712955042212316173ccf37be  b.txt
a26ddab1c324780ccb5199ef4dc38691  tb.tar
90d6ba204493d8e54d3b3b155bb7f370  tb.tar.gz

So all hashes except for tb.tar are as expected. (Of course if you retry that example your .tar and .tar.gz will be different from this, because of different timestamps)

Any hint about how to get it work would be greatly appreciated, I really would prefer to have it in 1 run though (with the TeeReaders).

Upvotes: 5

Views: 2415

Answers (1)

Stephen Weinberg
Stephen Weinberg

Reputation: 53418

The issue occurs because tar doesn't read every byte from your reader. After hashing each file, you need to empty the reader to ensure every byte is read and hashed. The way I normally do this is use io.Copy() to read until EOF.

package main

import (
    "archive/tar"
    "compress/gzip"
    "crypto/md5"
    "fmt"
    "io"
    "io/ioutil"
    "os"
)

func main() {
    tgz, _ := os.Open("tb.tar.gz")
    gzMd5 := md5.New()
    gz, _ := gzip.NewReader(io.TeeReader(tgz, gzMd5))
    tarMd5 := md5.New()
    tee := io.TeeReader(gz, tarMd5) // need the reader later
    tr := tar.NewReader(tee)
    for {
        fileMd5 := md5.New()
        hdr, err := tr.Next()
        if err == io.EOF {
            break
        }
        io.Copy(fileMd5, tr)
        fmt.Printf("%x  %s\n", fileMd5.Sum(nil), hdr.Name)
    }
    io.Copy(ioutil.Discard, tee) // read unused portions of the tar file
    fmt.Printf("%x  tb.tar\n", tarMd5.Sum(nil))
    fmt.Printf("%x  tb.tar.gz\n", gzMd5.Sum(nil))
}

Another option is to just add io.Copy(tarMd5, gz) before your tarMd5.Sum() call. I think the first way is clearer even if I needed to add/modify four lines instead of one.

Upvotes: 5

Related Questions