Reputation: 5515
I am learning how to read efficiently very large files in Go. I have tried bufio.NewScanner
and bufio.NewReader
with ReadString('\n')
. Among both options, NewScanner
seems to be consistently faster (2:1).
For NewScanner
I found it takes much more time to read a file line by line than running a unix cat command to read the file.
I have measured how long does it take to run this code:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, _ := os.Open("test")
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
when you compare against a regular unix cat
output I get the following results:
$ time ./parser3 > /dev/null
19.13 real 13.81 user 5.94 sys
$ time cat test > /dev/null
0.83 real 0.08 user 0.74 sys
The time difference is consistent among several executions.
I understand that scanning for '\n'
adds overhead than rather just copying data from input to output as cat does.
But seeing the difference between cat
and this code snippet I am asking myself if this is the most efficient way to read a file line by line in Go.
Upvotes: 0
Views: 3695
Reputation: 5515
As per MuffinTop's comment, this is the snippet of code that improves the speed. The performance penalty is not related to the usage of Scanner, but to the fact of:
scanner.Text()
-which allocates a string- instead of scanner.Bytes()
Performance adding output buffering:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, _ := os.Open("test")
w := bufio.NewWriter(os.Stdout)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Fprintln(w, scanner.Text())
}
}
The above solution uses output buffering and takes 6.6 seconds versus the original 19.1 seconds
Performance adding output buffering and using .Bytes()
instead of .Text()
output:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, _ := os.Open("test")
w := bufio.NewWriter(os.Stdout)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
w.Write(scanner.Bytes()); w.WriteByte('\n')
}
}
The above solution uses output buffering and outputs Bytes from the scanner and takes 2.2 seconds versus the original 19.1 seconds.
Upvotes: 5
Reputation: 1267
First consider what type of data you will be reading CSV,JSON, and the system that will be reading it and why ? all these details must be taken to consideration. There is no one size fits all and the best developers know how to use the right tool
for the job
. Without knowing ram limits, data etc its unclear. Will we be reading large amounts of JSON from a server? or will we be parsing a small text file, usually each function has a purpose. Do not get in the habit of thinking one way is better or worse than the other and limit your learning and skill set.
some methods may be faster but it reads the whole file into memory what if your file is too big?
Or what if your files are not that large, looking for duplicate names ? If you scan a file line-by-line, perhaps if you're searching for multiple items, is reading line by line the best way ?
You should checkout this post. Special thanks to @Schwern who answered your question already Here. Using bytes and scanner is slightly faster. See: Link
I found that using scanner.Bytes() instead of scanner.Text() improves speed
slightly on my machine. bufio's scanner.Bytes() method
doesn't allocate any additional memory, whereas Text() creates a string from its buffer.
Upvotes: -1