raizik
raizik

Reputation: 19

Reading a random line from a file in constant time in Go

I have the following code to choose 2 random lines from a file containing lines of the form ip:port:

import (
  "os"
   "fmt"
"math/rand"
"log"
"time"
"unicode/utf8"

//"bufio"
)
func main() {
fmt.Println("num bytes in line is: \n", utf8.RuneCountInString("10.244.1.8:8080"))
file_pods_array, err_file_pods_array := os.Open("pods_array.txt")
if err_file_pods_array != nil {
        log.Fatalf("failed opening file: %s", err_file_pods_array)
}
//16 = num of bytes in ip:port pair
randsource := rand.NewSource(time.Now().UnixNano())
                randgenerator := rand.New(randsource)
                firstLoc := randgenerator.Intn(10)
                secondLoc := randgenerator.Intn(10)
                candidate1 := ""
                candidate2 := ""
num_bytes_from_start_first := 16 * (firstLoc + 1)
num_bytes_from_start_second := 16 * (secondLoc + 1)
    buf_ipport_first := make([]byte, int64(15))
    buf_ipport_second := make([]byte, int64(15))
    start_first := int64(num_bytes_from_start_first)
    start_second := int64(num_bytes_from_start_second)
    _, err_first := file_pods_array.ReadAt(buf_ipport_first, start_first)
    first_ipport_ep := buf_ipport_first
    if err_first == nil {
            candidate1 = string(first_ipport_ep)
    }
    _, err_second := file_pods_array.ReadAt(buf_ipport_second, start_second)
    second_ipport_ep := buf_ipport_second

    if err_second == nil {
            candidate2 = string(second_ipport_ep)
    }
fmt.Println("first is: ", candidate1)
fmt.Println("sec is: ", candidate2)
}

This sometimes prints empty or partial lines. Why does this happen and how can I fix it?

Output example:

num bytes in line is:
 15
first is: 10.244.1.17:808
sec is:
10.244.1.11:80

Thank you.

Upvotes: 0

Views: 1846

Answers (2)

Schwern
Schwern

Reputation: 164769

If your lines were of a fixed length you could do this in constant time.

  1. Length of each line is L.
  2. Check the size of the file, S.
  3. Divide S/L to get the number of lines N.
  4. Pick a random number R from 0 to N-1.
  5. Seek to R*L in the file.
  6. Read L bytes.

But you don't have fixed length lines. We can't do constant time, but we can do it in constant memory and O(n) time using the technique from The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

  1. Read a line. Remember its line number M.
  2. Pick a random number from 1 to M.
  3. If it's 1, remember this line.

That is, as you read each line you have a 1/M chance of picking it. Cumulatively this adds up to 1/N for every line.

If we have three lines, the first line has a 1/1 chance of being picked. Then a 1/2 chance of remaining. Then a 2/3 chance of remaining. Total chance: 1 * 1/2 * 2/3 = 1/3.

The second line has a 1/2 chance of being picked and a 2/3 chance of remaining. Total chance: 1/2 * 2/3 = 1/3.

The third line has a 1/3 chance of being picked.

package main

import(
    "bufio"
    "fmt"
    "os"
    "log"
    "math/rand"
    "time"
);

func main() {
    file, err := os.Open("pods_array.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)

    randsource := rand.NewSource(time.Now().UnixNano())
    randgenerator := rand.New(randsource)

    lineNum := 1
    var pick string
    for scanner.Scan() {
        line := scanner.Text()
        fmt.Printf("Considering %v at 1/%v.\n", scanner.Text(), lineNum)

        // Instead of 1 to N it's 0 to N-1
        roll := randgenerator.Intn(lineNum)        
        fmt.Printf("We rolled a %v.\n", roll)
        if roll == 0 {
            fmt.Printf("Picking line.\n")
            pick = line
        }

        lineNum += 1
    }

    fmt.Printf("Picked: %v\n", pick)
}

Because rand.Intn(n) returns [0,n), that is from 0 to n-1, we check for 0, not 1.


Maybe you're thinking "what if I seek to a random point in the file and then read the next full line?" That wouldn't quite be constant time, it would beO(longest-line), but it wouldn't be truly random. Longer lines would get picked more frequently.


Note that since these are (I assume) all IP addresses and ports you could have constant record lengths. Store the IPv4 address as a 32 bits and the port as a 16 bits. 48 bits per line.

However, this will break on IPv6. For forward compatibility store everything as IPv6: 128 bits for the IP and 16 bits for the port. 144 bits per line. Convert IPv4 addresses to IPv6 for storage.

This will allow you to pick random addresses in constant time, and it will save disk space.

Alternatively, store them in SQLite.

Upvotes: 1

raizik
raizik

Reputation: 19

found a solution using ioutil and strings:

func main() {
randsource := rand.NewSource(time.Now().UnixNano())
                randgenerator := rand.New(randsource)
                firstLoc := randgenerator.Intn(10)
                secondLoc := randgenerator.Intn(10)
                candidate1 := ""
                candidate2 := ""

        dat, err := ioutil.ReadFile("pods_array.txt")
        if err == nil {
                ascii := string(dat)
                splt := strings.Split(ascii, "\n")
candidate1 = splt[firstLoc]
candidate2 = splt[secondLoc]
        }
fmt.Println(candidate1)
fmt.Println(candidate2)
}

Output

10.244.1.3:8080
10.244.1.11:8080

Upvotes: 0

Related Questions