Performance issues while reading a file line by line with bufio.NewScanner

Question

I am learning how to read efficiently very large files in Go. I have tried bufio.NewScanner and bufio.NewReader with ReadString(' '). Among both options, NewScanner seems to be consistently faster (2:1).

For NewScanner I found it takes much more time to read a file line by line than running a unix cat command to read the file.

I have measured how long does it take to run this code:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {

     file, _ := os.Open("test")
     scanner := bufio.NewScanner(file)
     for scanner.Scan() {
        fmt.Println(scanner.Text())
     }

}

when you compare against a regular unix cat output I get the following results:

$ time ./parser3 > /dev/null
       19.13 real        13.81 user         5.94 sys
$ time cat test > /dev/null
        0.83 real         0.08 user         0.74 sys

The time difference is consistent among several executions.

I understand that scanning for ' ' adds overhead than rather just copying data from input to output as cat does.

But seeing the difference between cat and this code snippet I am asking myself if this is the most efficient way to read a file line by line in Go.

M.E. · Accepted Answer

As per MuffinTop's comment, this is the snippet of code that improves the speed. The performance penalty is not related to the usage of Scanner, but to the fact of:

Not using buffering in the output
Using scanner.Text() -which allocates a string- instead of scanner.Bytes()

Performance adding output buffering:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {

     file, _ := os.Open("test")
     w := bufio.NewWriter(os.Stdout) 
     scanner := bufio.NewScanner(file)
     for scanner.Scan() {
        fmt.Fprintln(w, scanner.Text())
     }

}

The above solution uses output buffering and takes 6.6 seconds versus the original 19.1 seconds

Performance adding output buffering and using .Bytes() instead of .Text() output:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {

     file, _ := os.Open("test")
     w := bufio.NewWriter(os.Stdout) 
     scanner := bufio.NewScanner(file)
     for scanner.Scan() {
        w.Write(scanner.Bytes()); w.WriteByte('
')
     }

}

The above solution uses output buffering and outputs Bytes from the scanner and takes 2.2 seconds versus the original 19.1 seconds.

Performance issues while reading a file line by line with bufio.NewScanner

Answers (2)

ways to read a file.

Related Questions