gyula
gyula

Reputation: 267

Is there a faster alternative to ioutil.ReadFile?

I am trying to make a program for checking file duplicates based on md5 checksum. Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram

func search() {
    unique := make(map[string]string)
    files, err := ioutil.ReadDir(".")
    if err != nil {
        log.Println(err)
    }

    for _, file := range files {
        fileName := file.Name()
        fmt.Println("CHECKING:", fileName)
        fi, err := os.Stat(fileName)
        if err != nil {
            fmt.Println(err)
            continue
        }
        if fi.Mode().IsRegular() {
            data, err := ioutil.ReadFile(fileName)
            if err != nil {
                fmt.Println(err)
                continue
            }
            sum := md5.Sum(data)
            hexDigest := hex.EncodeToString(sum[:])
            if _, ok := unique[hexDigest]; ok == false {
                unique[hexDigest] = fileName
            } else {
                fmt.Println("DUPLICATE:", fileName)
            }
        }
    }
}

As per my debugging the issue is with the file reading Is there a better approach to do that? thanks

Upvotes: 3

Views: 6231

Answers (2)

Kiril
Kiril

Reputation: 6219

There is an example in the Golang documentation, which covers your case.

package main

import (
    "crypto/md5"
    "fmt"
    "io"
    "log"
    "os"
)

func main() {
    f, err := os.Open("file.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()

    h := md5.New()
    if _, err := io.Copy(h, f); err != nil {
        log.Fatal(err)
    }

    fmt.Printf("%x", h.Sum(nil))
}

For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.

Upvotes: 6

twotwotwo
twotwotwo

Reputation: 30007

Sounds like the 16GB RAM is your problem, not speed per se.

Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.

This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.

Upvotes: 5

Related Questions