Marvin.Hansen
Marvin.Hansen

Reputation: 1638

Golang: How to capture return values of massively parallel benchmark (> 1 million tasks)?

I'm building a parameter optimizer that essentially generates configurations, benchmarks all of them, collects all results, sorts them, and then picks the best performing configuration relative to the benchmark result.

The benchmark by itself works fine, but takes between 50 ms an 2 sec per run depending on the configuration. The crux is, the optimizer generates a very large number of configuration, that means, between 100k on the lowest end and about 40 million on the higher end with about 1 - 5 million as a good normal range. Obviously, the single threaded version takes forever and CPU load is actually very low as the task is relatively light.

I already designed the benchmark in a way to make it play nicely with concurrency, that is, the runner is encapsulated in a separate struct (called agent) and the benchmark is essentially a pure function that takes all state as a parameter. Essentially, each run creates its own state and then runs independently of all others, but all functions using the same (referenced) shared dataset. The function is shown below.

However, I struggle with dealing with the return value per Benchmark. Back in the days, in Scale we used Async / Await for task parallelism and just let the results roll on. Go Routines, afaik only work well with functions that have no return value. In practice, channels are the most natural way to fetch a value from a goroutine. And that's the crux I'm mulling over:

Considering that I usually have > 1 million tasks, how do I catch the return values correctly and efficiently?

Related to that, is there actually a very fast parameter optimizer for Golang? For python, I remember optuna delivering excellent results.

Thank you

func (a *Agent) runOptimization(strategyConfigs []cmdb.Config) (result *bmx.OptimizeResult) {

scores := make([]bmx.BackTestResult, len(strategyConfigs))

println("Run all benchmarks")

for i, config := range strategyConfigs {
    state := newState(&config)
    score := a.runBenchmark(state)
    scores[i] = *score // sort only works on actual values
}


println("Sort results")
sort.Sort(bmx.ByTotalPnL(scores))

println("Select best config")
best := scores[len(scores)-1]

println("Generate best strategy config")
stratConf := a.getStrategyConfig(best.PatternConfig)

println("Return optimization results ")
result = &bmx.OptimizeResult{
    Symbol:          best.Symbol,
    StrategyType:    best.StrategyType,
    OptimizedConfig: &stratConf,
    ...

}
    return result
 }
 

Upvotes: 2

Views: 1376

Answers (1)

kostix
kostix

Reputation: 55453

There are multiple ways to do this.

A "textbook" one is like:

results := make(chan *score)

for i, config := range strategyConfigs {
    state := newState(&config)
    go a.runBenchmark(state, results)
}

for i := 0; i < len(strategyConfigs); i++ {
  scores[i] = <-results
}

…and then modify your runBenchmark method to not return any values and accept the second argument of type chan *score.

The snippet rolls like this:

  1. Creates a channel to exchange values of type *score.
  2. Starts as many goroutines running the runBenchmark metod of — I suppose — "an agent".
    The method sends the (pointer to) a score object it computed over the channel submitted to it and exits.
  3. Another loop performs as many receives from the channel as there were goroutines spawned and places each received values into the resulting slice.

Caveats:

  • This presupposes a is okay with executing its runBenchmark with multiple goroutines running concurrently.

    If it is not okay, you will supposedly need to create a separate a to run each separate goroutine.
    Given that your example is not too minimal, it's hard for me to make an educated guess about how hard/simple it would be.
    If you will need help on this, please ask a separate narrow question.

  • If you will have, like, multiple hundred millions "strategy configs", this approach would be too simplistic as all the goroutines will be spawned at once, which a) is a waste of resources; b) may even fail if the number will be too big.
    A textbook fix is to use the so-called "fan-out" — when you have a single goroutine receiving "tasks" over a channel and distributing them onto a limited number of worker goroutines which is kept below certain limit at all times. You can start here.

Another approach is to employ the fact that in Go, each element of an array (and a slice — by extension) is considered a separate variable. This means that it's fine to concurrently update individual elements of the resulting slice from the worker goroutines — as long as the slice is preallocated and never reallocated (manipulated with append, resliced etc) while this process is ongoing.

To demonstrate, let's use "wait groups":

import "sync"

var wg sync.WaitGroup

scores := make([]*score, len(strategyConfigs))

wg.Add(len(strategyConfigs))
for i, config := range strategyConfigs {
    state := newState(&config)
    go a.runBenchmark(state, &scores[i], &wg)
}

wg.Wait()

The runBenchmark method should be modified to have

defer wg.Done()

as its first statement and to accept two additional arguments — of type *score and *sync.WaitGroup.

Here, runBenchmark is started to run in a separate goroutine and is passed an address of the element to update with its result and an address of a wait group to signal "task completion" on.

Note that basically the same caveats as in the first case apply.

As you can see, a goroutine does not indeed return any value. This is mostly because by the time it could, the goroutine which spawned it may be long gone, and there is nowhere to return that value to.

Hence there are basically two methods to "get data out" of a goroutine:

  • Send that value on a channel (and have some other goroutine receive from that channel).

    This is the bread and butter of Go. I would recommend to start with this approach and work with it until you feel fully comfortable with it.

  • Update some place in memory providing no other goroutine does the same (otherwise you'd have a data race).

    This approach may be faster (and even simpler, to some) in certain cases but the reasoning behind such code might be harder to see.

You might start with this an this to get the basic concepts covered.


All-in-all, a couple of pointers.

Upvotes: 3

Related Questions