Carlj901
Carlj901

Reputation: 1411

Go: Unexpected performance when accessing an array through slice of slices (2D slice)

I was doing some performance experimentation in Go with matrix multiplication and ran into some unexpected results.

Version 1:

func newMatrix(n int) [][]int {
    m := make([][]int, n)
    buf := make([]int, n*n)

    for i := range m {
        m[i] = buf[i*n : (i+1)*n]
    }

    return m
}

func mult1(m1, m2, res [][]int) [][]int {
    for i := range m1 {
        for k := range m1[0] {
            for j := range m2[0] {
                res[i][j] += m1[i][k] * m2[k][j]
            }
        }
    }

    return res
}

From the linear array i create multiple slices that represent the matrix rows.

Version 2:

func mult2(m1, m2, res []int, n int) []int {
    for i := 0; i < n; i++ {
        for k := 0; k < n; k++ {
            for j := 0; j < n; j++ {
                res[i*n+j] += m1[i*n+k] * m2[k*n+j]
            }
        }
    }

    return res
}

In this version I simply use a linear array and index into it from the multiplication.

Multiplying 2 2048x2048 matrices give the following execution time:

 version 1: 35.550813801s
 version 2: 19.090223468s

Version 2 is almost twice as fast.

I used the approach below to take the measurements:

start := time.Now()
mult(m1, m2, m3)
stop := time.Now()

I am aware using slices give another layer of indirection which could impact the cache performance, however I didn't expect it would be such a big difference. Unfortunately I haven't found any good tool, that works with Mac, that can analyse cache efficiency in Go, so I can't say for sure if this is what's causing the performance difference.

So I guess I'm asking is this expected behavior or is there something I'm missing?

Software and hardware: Go version 1.4.2 darwin/amd64; OS X 10.10.3; 2 GHz quad-core i7.

Upvotes: 6

Views: 714

Answers (2)

VAndrei
VAndrei

Reputation: 5570

The main problem in your version 1 code seems to be indirect addressing. Even though the layout in memory for the matrices in both versions is the same, using indirect addressing can lead to:

  • More generated instructions for the same code. The compiler could have trouble in determining when to use packed versions of SIMD instructions (e.g. SSE, AVX). You can verify this by dumping the assembly code, look for XMM or YMM registers and check if the instructions operating on the registers are packed.
  • You make it difficult for the compiler to add software prefetches. Because indirect addressing, it's difficult for the compiler to detect how to add software prefetches. You can look for vprefetch instructions in the assembly code.
  • The hardware prefetcher will be less efficient also due to indirect addressing. You first need to access the line start address and then access the line elements so it's hard to observe that the hardware prefetcher should just fetch consecutive addresses. That's only measurable through profiling like perf.

So in case of version 1, indirect addressing is the main issue. I also recommend running the 2 codes in multiple iterations to remove the cache warming penalty which might be higher for version 1 because of what I explained above.

Upvotes: 6

Zaript
Zaript

Reputation: 31

Unfortunately, I do not have enough reputation to put this in as comment, but in addition to VAndrei's points it is worth noting that two provided examples use for-loop differently. How does first example perform after s/i := range m1/i := 0; i < n; i++/ ?

It could also be useful to check how "list mult1" and "list mult2" outputs looks like in pprof. There is great tutorial to get started with Go's pprof very fast: Profiling Go Programs By Russ Cox

Upvotes: -1

Related Questions