Reputation: 12273
So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.
There are two large vectors with input/output values
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
)
these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:
var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// eg nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
}
ch <- wd
}(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
d += <-ch
}
The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.
The problem I'm having is that this code is no faster than the serial code.
Now, I'm using Go 1.7 so runtime.GOMAXPROCS
should be already set to runtime.NumCPU()
, but even setting it explicitly does not improves performances.
(a-b)*(a-b)
;math.Pow
and math.Sqrt
functions);So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand
, for example).
I also compiled with -race
and nothing emerged.
My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.
I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.
How can I debug this kind of issues? Can pprof
help me in this case? What about the runtime
package?
Thanks in advance
Upvotes: 1
Views: 482
Reputation: 12273
Sorry, but in the end I got the measurement wrong. @JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.
My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.
After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.
Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!
Upvotes: 1