user1131435
user1131435

Reputation:

Why does this program run faster when it's allocated fewer threads?

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.

The code is as follows:

var workerWG sync.WaitGroup

func worker(fibNum chan int) {
    for {
        var tgt = <-fibNum
        workerWG.Add(1)
        var a, b float64 = 0, 1
        for i := 0; i < tgt; i++ {
            a, b = a+b, a
        }
        workerWG.Done()
    }
}

func main() {
    rand.Seed(time.Now().UnixNano())
    runtime.GOMAXPROCS(1) // LINE IN QUESTION

    var fibNum = make(chan int)

    for i := 0; i < 4; i++ {
        go worker(fibNum)
    }
    for i := 0; i < 500000; i++ {
        fibNum <- rand.Intn(1000)
    }
    workerWG.Wait()
}

If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.

What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?

My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:

When I replace the worker function with the following code:

for {
    <-fibNum
    time.Sleep(500 * time.Millisecond)
}

both one available thread and four available threads take the same amount of time.

Upvotes: 4

Views: 443

Answers (3)

SnoProblem
SnoProblem

Reputation: 1939

The code written by @thwd is correct and idiomatic Go.

Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.

  • Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
  • On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).

A couple hints for simplifying code:

  • For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
  • Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
  • If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.

Upvotes: 1

thwd
thwd

Reputation: 24848

I revised your program to look like the following:

package main

import (
    "math/rand"
    "runtime"
    "sync"
    "time"
)

var workerWG sync.WaitGroup

func worker(fibNum chan int) {
    for tgt := range fibNum {
        var a, b float64 = 0, 1
        for i := 0; i < tgt; i++ {
            a, b = a+b, a
        }
    }
    workerWG.Done()
}

func main() {
    rand.Seed(time.Now().UnixNano())
    runtime.GOMAXPROCS(1) // LINE IN QUESTION

    var fibNum = make(chan int)

    for i := 0; i < 4; i++ {
        go worker(fibNum)
        workerWG.Add(1)
    }
    for i := 0; i < 500000; i++ {
        fibNum <- rand.Intn(100000)
    }
    close(fibNum)
    workerWG.Wait()
}
  • I cleaned up the wait group usage.
  • I changed rand.Intn(1000) to rand.Intn(100000)

On my machine that produces:

$ time go run threading.go (GOMAXPROCS=1)

real    0m20.934s
user    0m20.932s
sys 0m0.012s

$ time go run threading.go (GOMAXPROCS=8)

real    0m10.634s
user    0m44.184s
sys 0m1.928s

This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.

In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.

Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:

$ time go run threading.go 

real    0m4.971s
user    0m35.692s
sys 0m0.044s

Upvotes: 5

Volker
Volker

Reputation: 42431

Your main computation routine in worker does not allow the scheduler to run. Calling the scheduler manually like

    for i := 0; i < tgt; i++ {
        a, b = a+b, a
        if i%300 == 0 {
            runtime.Gosched()
        }
    }

Reduces wall clock by 30% when switching from one to two threads.

Such artificial microbenchmarks are really hard to get right.

Upvotes: -1

Related Questions