M.E.
M.E.

Reputation: 5515

Performance issues while reading a file line by line with bufio.NewScanner

I am learning how to read efficiently very large files in Go. I have tried bufio.NewScanner and bufio.NewReader with ReadString('\n'). Among both options, NewScanner seems to be consistently faster (2:1).

For NewScanner I found it takes much more time to read a file line by line than running a unix cat command to read the file.

I have measured how long does it take to run this code:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {

     file, _ := os.Open("test")
     scanner := bufio.NewScanner(file)
     for scanner.Scan() {
        fmt.Println(scanner.Text())
     }

}

when you compare against a regular unix cat output I get the following results:

$ time ./parser3 > /dev/null
       19.13 real        13.81 user         5.94 sys
$ time cat test > /dev/null
        0.83 real         0.08 user         0.74 sys

The time difference is consistent among several executions.

I understand that scanning for '\n' adds overhead than rather just copying data from input to output as cat does.

But seeing the difference between cat and this code snippet I am asking myself if this is the most efficient way to read a file line by line in Go.

Upvotes: 0

Views: 3695

Answers (2)

M.E.
M.E.

Reputation: 5515

As per MuffinTop's comment, this is the snippet of code that improves the speed. The performance penalty is not related to the usage of Scanner, but to the fact of:

  1. Not using buffering in the output
  2. Using scanner.Text() -which allocates a string- instead of scanner.Bytes()

Performance adding output buffering:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {

     file, _ := os.Open("test")
     w := bufio.NewWriter(os.Stdout) 
     scanner := bufio.NewScanner(file)
     for scanner.Scan() {
        fmt.Fprintln(w, scanner.Text())
     }

}

The above solution uses output buffering and takes 6.6 seconds versus the original 19.1 seconds

Performance adding output buffering and using .Bytes() instead of .Text() output:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {

     file, _ := os.Open("test")
     w := bufio.NewWriter(os.Stdout) 
     scanner := bufio.NewScanner(file)
     for scanner.Scan() {
        w.Write(scanner.Bytes()); w.WriteByte('\n')
     }

}

The above solution uses output buffering and outputs Bytes from the scanner and takes 2.2 seconds versus the original 19.1 seconds.

Upvotes: 5

Josh
Josh

Reputation: 1267

First consider what type of data you will be reading CSV,JSON, and the system that will be reading it and why ? all these details must be taken to consideration. There is no one size fits all and the best developers know how to use the right tool for the job. Without knowing ram limits, data etc its unclear. Will we be reading large amounts of JSON from a server? or will we be parsing a small text file, usually each function has a purpose. Do not get in the habit of thinking one way is better or worse than the other and limit your learning and skill set.

ways to read a file.


  • Document parsing
  • reads all data from the file and turns it into object.

  • Stream parsing
  • reads a single element at a time then it moves on to the next one.

some methods may be faster but it reads the whole file into memory what if your file is too big?

Or what if your files are not that large, looking for duplicate names ? If you scan a file line-by-line, perhaps if you're searching for multiple items, is reading line by line the best way ?

You should checkout this post. Special thanks to @Schwern who answered your question already Here. Using bytes and scanner is slightly faster. See: Link

 
I found that using scanner.Bytes() instead of scanner.Text() improves speed
 slightly on my machine.  bufio's scanner.Bytes() method 
doesn't allocate any additional memory, whereas Text() creates a string from its buffer.

Upvotes: -1

Related Questions