Reputation: 3556
I'm trying to figure out the fastest way of reading a large file line by line and checking if the line contains a string. The file I'm testing on is about 680mb in size:
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
func main() {
f, err := os.Open("./crackstation-human-only.txt")
scanner := bufio.NewScanner(f)
if err != nil {
panic(err)
}
defer f.Close()
for scanner.Scan() {
if strings.Contains(scanner.Text(), "Iforgotmypassword") {
fmt.Println(scanner.Text())
}
}
}
After building the program and timing it on my machine it runs over 3 seconds
./speed 3.13s user 1.25s system 122% cpu 3.563 total
After increasing the buffer
buf := make([]byte, 64*1024)
scanner.Buffer(buf, bufio.MaxScanTokenSize)
It gets a little better
./speed 2.47s user 0.25s system 104% cpu 2.609 total
I know it can get better because other tools manage to do it under a second without any kind of indexing. What seems to be the bottleneck with this approach?
0.33s user 0.14s system 94% cpu 0.501 total
Upvotes: 6
Views: 10599
Reputation: 5895
H. Ross' answer is awesome, but it reads the whole file into memory, which may not be feasible if your file is too big. If you still want to scan line-by-line, perhaps if you're searching for multiple items, I found that using scanner.Bytes() instead of scanner.Text() improves speed slightly on my machine, from 2.244s for the original question, to 1.608s. bufio's scanner.Bytes() method doesn't allocate any additional memory, whereas Text() creates a string from its buffer.
package main
import (
"bufio"
"fmt"
"os"
"bytes"
)
// uses scanner.Bytes to avoid allocation.
func main() {
f, err := os.Open("./crackstation-human-only.txt")
scanner := bufio.NewScanner(f)
if err != nil {
panic(err)
}
defer f.Close()
toFind := []byte("Iforgotmypassword")
for scanner.Scan() {
if bytes.Contains(scanner.Bytes(), toFind) {
fmt.Println(scanner.Text())
}
}
}
Upvotes: 1
Reputation: 540
LAST EDIT
This is a "line-by-line" solution to the problem that takes trivial time, it prints the entire matching line.
package main
import (
"bytes"
"fmt"
"io/ioutil"
)
func main() {
dat, _ := ioutil.ReadFile("./jumble.txt")
i := bytes.Index(dat, []byte("Iforgotmypassword"))
if i != -1 {
var x int
var y int
for x = i; x > 0; x-- {
if dat[x] == byte('\n') {
break
}
}
for y = i; y < len(dat); y++ {
if dat[y] == byte('\n') {
break
}
}
fmt.Println(string(dat[x : y+1]))
}
}
real 0m0.421s
user 0m0.068s
sys 0m0.352s
ORIGINAL ANSWER
If you just need to see if the string is in a file, why not use regex?
Note: I kept the data as a byte array instead of converting to string.
package main
import (
"fmt"
"io/ioutil"
"regexp"
)
var regex = regexp.MustCompile(`Ilostmypassword`)
func main() {
dat, _ := ioutil.ReadFile("./jumble.txt")
if regex.Match(dat) {
fmt.Println("Yes")
}
}
jumble.txt
is a 859 MB of jumbled text with newlines included.
Running with time ./code
I get:
real 0m0.405s
user 0m0.064s
sys 0m0.340s
To try and answer your comment, I don't think the bottleneck is inherently coming from searching line by line, Golang uses an efficient algorithm for searching strings/runes.
I think the bottleneck comes from the IO reads, when the program reads from the file, it is normally not first in line in the queue of reading, therefore, the program must wait until it can read in order to start actually comparing. Thus, when you are reading in over and over, you are being forced to wait for your turn in IO.
To give you some math, if your buffer size is 64 * 1024 (or 65535 bytes), and your file is 1 GB. Dividing 1 GB / 65535 bytes is 15249 reads needed to check the entire file. Where as in my method, I read the entire file "at once" and check against that constructed array.
Another thing I can think of is just the utter amount of loops needed to move through the file and the time needed for each loop:
Given the following code:
dat, _ := ioutil.ReadFile("./jumble.txt")
sdat := bytes.Split(dat, []byte{'\n'})
for _, l := range sdat {
if bytes.Equal([]byte("Iforgotmypassword"), l) {
fmt.Println("Yes")
}
}
I calculated that each loop takes on average 32 nanoseconds, the string Iforgotmypassword was on line 100000000 in my file, thus the execution time for this loop was roughly 32 nanoseconds * 100000000 ~= 3.2 seconds.
Upvotes: 5
Reputation: 34297
Using my own 700MB test file with your original, time was just over 7 seconds
With grep it was 0.49 seconds
With this program (which doesn't print out the line, it just says yes) 0.082 seconds
package main
import (
"bytes"
"fmt"
"io/ioutil"
"os"
)
func check(e error) {
if e != nil {
panic(e)
}
}
func main() {
find := []byte(os.Args[1])
dat, err := ioutil.ReadFile("crackstation-human-only.txt")
check(err)
if bytes.Contains(dat, find) {
fmt.Print("yes")
}
}
Upvotes: 1