Reputation: 1411
I was doing some performance experimentation in Go with matrix multiplication and ran into some unexpected results.
Version 1:
func newMatrix(n int) [][]int {
m := make([][]int, n)
buf := make([]int, n*n)
for i := range m {
m[i] = buf[i*n : (i+1)*n]
}
return m
}
func mult1(m1, m2, res [][]int) [][]int {
for i := range m1 {
for k := range m1[0] {
for j := range m2[0] {
res[i][j] += m1[i][k] * m2[k][j]
}
}
}
return res
}
From the linear array i create multiple slices that represent the matrix rows.
Version 2:
func mult2(m1, m2, res []int, n int) []int {
for i := 0; i < n; i++ {
for k := 0; k < n; k++ {
for j := 0; j < n; j++ {
res[i*n+j] += m1[i*n+k] * m2[k*n+j]
}
}
}
return res
}
In this version I simply use a linear array and index into it from the multiplication.
Multiplying 2 2048x2048 matrices give the following execution time:
version 1: 35.550813801s
version 2: 19.090223468s
Version 2 is almost twice as fast.
I used the approach below to take the measurements:
start := time.Now()
mult(m1, m2, m3)
stop := time.Now()
I am aware using slices give another layer of indirection which could impact the cache performance, however I didn't expect it would be such a big difference. Unfortunately I haven't found any good tool, that works with Mac, that can analyse cache efficiency in Go, so I can't say for sure if this is what's causing the performance difference.
So I guess I'm asking is this expected behavior or is there something I'm missing?
Software and hardware: Go version 1.4.2 darwin/amd64; OS X 10.10.3; 2 GHz quad-core i7.
Upvotes: 6
Views: 714
Reputation: 5570
The main problem in your version 1 code seems to be indirect addressing. Even though the layout in memory for the matrices in both versions is the same, using indirect addressing can lead to:
So in case of version 1, indirect addressing is the main issue. I also recommend running the 2 codes in multiple iterations to remove the cache warming penalty which might be higher for version 1 because of what I explained above.
Upvotes: 6
Reputation: 31
Unfortunately, I do not have enough reputation to put this in as comment, but in addition to VAndrei's points it is worth noting that two provided examples use for-loop differently. How does first example perform after s/i := range m1/i := 0; i < n; i++/
?
It could also be useful to check how "list mult1" and "list mult2" outputs looks like in pprof. There is great tutorial to get started with Go's pprof very fast: Profiling Go Programs By Russ Cox
Upvotes: -1