pjox
pjox

Reputation: 153

How to improve file encoding conversion in Go

I've been working with some huge files that I have to convert to UTF-8, as the files ar enormous traditional tools like iconv won't work. So I decided to write my own tool in Go, however I noticed that this encoding conversion is quite slow in Go. here is my code:

package main

import (
    "fmt"
    "io"
    "log"
    "os"

    "golang.org/x/text/encoding/charmap"
)

func main() {
    if len(os.Args) != 3 {
        fmt.Fprintf(os.Stderr, "usage:\n\t%s [input] [output]\n", os.Args[0])
        os.Exit(1)
    }

    f, err := os.Open(os.Args[1])

    if err != nil {
        log.Fatal(err)
    }

    out, err := os.Create(os.Args[2])

    if err != nil {
        log.Fatal(err)
    }

    r := charmap.ISO8859_1.NewDecoder().Reader(f)

    buf := make([]byte, 1048576)

    io.CopyBuffer(out, r, buf)

    out.Close()
    f.Close()
}

Similar code in Python is much more performant:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("FRWAC-01.xml", "r", "latin_1") as sourceFile:
    with codecs.open("FRWAC-01-utf8.xml", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

I was sure my Go code would be much quicker because in general I/O in Go is fast, but it turns out is much slower than the Python code. Is there a way to improve the Go program?

Upvotes: 0

Views: 1079

Answers (1)

Mr_Pink
Mr_Pink

Reputation: 109347

The problem here is that you're not comparing the same code in both cases. Also IO speed in Go can't be significantly different that python, since they are making the same syscalls.

In the python version, the files are buffered by default. In the Go version, while you're using io.CopyBuffer with a 1048576 byte buffer, the decoder is going to make whatever size Read calls it needs directly on the unbuffered file.

Wrapping the file IO with bufio will produce comparable results.

inFile, err := os.Open(os.Args[1])
if err != nil {
    log.Fatal(err)
}
defer inFile.Close()

outFile, err := os.Create(os.Args[2])
if err != nil {
    log.Fatal(err)
}
defer outFile.Close()

in := bufio.NewReaderSize(inFile, 1<<20)

out := bufio.NewWriterSize(outFile, 1<<20)
defer out.Flush()

r := charmap.ISO8859_1.NewDecoder().Reader(in)

if _, err := io.Copy(out, r); err != nil {
    log.Fatal(err)
}

Upvotes: 4

Related Questions