toylas
toylas

Reputation: 437

Julia Array of arrays: (rows -> columns) performance

Complete Julia newbie here.

Given an array of arrays, I want to combine corresponding elements of each sub-array. Something like this:

 [2, 7, 9]       [2, 3, 2, 7, 3]
 [3, 5, 4]       [7, 5, 7, 9, 5]
 [2, 7, 7]  ->   [9, 4, 7, 1, 1]
 [7, 9, 1]
 [3, 5, 1]

Searching stackoverflow I came across a couple of solutions other than a direct loop or a list comprehension.

julia> a=Vector{Int}[rand(1:10,3) for i=1:5]
5-element Array{Array{Int64,1},1}:
 [2, 7, 9]
 [3, 5, 4]
 [2, 7, 7]
 [7, 9, 1]
 [3, 5, 1]

julia> using BenchmarkTools

julia> @btime a2=mapslices( x -> [x], hcat(a...), dims=2)[:]
  6.174 μs (65 allocations: 3.45 KiB)
3-element Array{Array{Int64,1},1}:
 [2, 3, 2, 7, 3]
 [7, 5, 7, 9, 5]
 [9, 4, 7, 1, 1]

julia> @btime a3=[getindex.(a,i) for i=1:length(a[1])]
  948.087 ns (14 allocations: 768 bytes)
3-element Array{Array{Int64,1},1}:
 [2, 3, 2, 7, 3]
 [7, 5, 7, 9, 5]
 [9, 4, 7, 1, 1]

My question is: Why is the second one about six times faster than the first one? Does it have to do with hcat?

Upvotes: 1

Views: 704

Answers (1)

Frames Catherine White
Frames Catherine White

Reputation: 28232

Baseline and Benchmarking correctly

Ok, first lets establish the baseline on my computer.

Before we do anything else we need to make sure we are not benchmarking on global variables. From the BenchmarkTools readme:

If the expression you want to benchmark depends on external variables, you should use $ to "interpolate" them into the benchmark expression to avoid the problems of benchmarking with globals. Essentially, any interpolated variable $x or expression $(...) is "pre-computed" before benchmarking begins...

julia> a=Vector{Int}[rand(1:10,3) for i=1:5];

julia> @btime a2=mapslices( x -> [x], hcat($a...), dims=2)[:];
  6.015 μs (65 allocations: 3.45 KiB)

julia> @btime a3=[getindex.($a,i) for i=1:length($a[1])];
  149.228 ns (6 allocations: 544 bytes)

(If i didn't interpolate, I would get the same as you roughly for a3 999.500 ns (14 allocations: 768 bytes)).

So it is a3 is not 6 times faster, but actually 33 times faster.

Why the difference?

Allocations.

Allocations are fairly slow compared to other operations (in all languages). We can see that the a2 code allocates much more than the a3 code.

So lets look at the bits that allocate:

a2

  • [x] allocates a new 1-element array for each column
  • hcat allocates a new matrix with everything concatenated
  • mapslices allocates an array for each slice it takes out of the matrix
  • mapslice allocate an array to hold the output (interesting that it does not do a view, but i checked)
  • [:] performs a reshaped copy of the output. (alternative would be vec which returns a reshapes view)

a3

  • getindex.(a, i) allocates an array for each column of the output (so same as mapslice internal slicing of the input matrix)
  • [ ... for ...] allocates an array for the output (so same as mapslices output)

    So we can see that there are a bunch of extra allocations going on in a2 that are not in a3.

    What if we only had the allocations assoicated with the hcat.

    Since the original question asks if it is because of the hcat lets look at that.

I define a new benchmark saving into a4. It uses eachslice which returns a (lazy) generator of views into slices of the matrix. So negligible allocations there. To stop it being lazy we collect it. The final output of this is an Array of SubArrays (rather than an Array of Arrays) but that is fine, it will work just as it still subtypes AbstractArray.

julia> @btime a4 = collect(eachslice(hcat($a...), dims=1));
  734.320 ns (13 allocations: 704 bytes)

Here our major allocations are the - hcat - collect which allocates the output (same as [ ... for ...]).

So yes, the hcat has an effect, but it is far from the majority of the difference.

Splatting and reduce(hcat, xs)

Splatting as a cost. Its generally pretty small until you get up to splatting hundreds of items, but because this is a microbenchmark and everthing else is so fast lets see how it goes to remove it.

Julia has an optimized function for reduce(hcat, xs) for xs being an array of arrays.

so lets see how that goes:

julia> @btime a2_s=mapslices(x -> [x], reduce(hcat, $a), dims=2);
  5.278 μs (59 allocations: 3.17 KiB)

julia> @btime a4_s=collect(eachslice(reduce(hcat, $a), dims=1));
  337.656 ns (8 allocations: 528 bytes)

We can see that is it makes a difference. But in the case of a2 it's not much, since the hcat is done once, where as the slow allocations in x->x and in mapslices copying slices out of the hcat happens many times.

Can we go faster?

Not really. a3 is pretty much the ideal code for this. It allocates nothing that it does not return.

Thought if we are willing to swap over to using StaticArrays we can get something truly unreasonably fast.

julia> b = @SVector [@SVector [rand(1:10) for ii in 1:3] for i=1:5];

julia> @btime b3=[getindex.($b,i) for i in 1:length($b[1])];
  36.055 ns (1 allocation: 208 bytes)

Static arrays gives the compiler a bunch more information. In particular the size of all the arrays, and the promise that none of them will ever be changed. This means it can: - unroll loops - bounds check at compile time - allocate them on the stack (rather than the heap) - probably some other things I have forgotten.

This lets the optimizer (both in Julia and in LLVM) go really wild. The wold them gets compiled down to basically 2 SSE/AVX vectorized move operations per input column (/output row), plus a small amount of fixed overhead.

julia> @code_native (b->[getindex.(b,i) for i in 1:length(b[1])])(b)
    .section    __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[83]:1 within `#161'
    subq    $136, %rsp
    vmovups (%rdi), %ymm0
    vmovups 32(%rdi), %ymm1
    vmovups 64(%rdi), %ymm2
    vmovups 88(%rdi), %ymm3
    vmovups %ymm3, 88(%rsp)
    vmovups %ymm2, 64(%rsp)
    vmovups %ymm1, 32(%rsp)
    vmovups %ymm0, (%rsp)
    movabsq $5152370032, %rax       ## imm = 0x1331AED70
; │┌ @ generator.jl:32 within `Generator' @ generator.jl:32
    vmovaps (%rax), %xmm0
    vmovups %xmm0, 120(%rsp)
; │└
    movabsq $collect, %rax
    movq    %rsp, %rdi
    vzeroupper
    callq   *%rax
    addq    $136, %rsp
    retq
    nop
; └

Upvotes: 4

Related Questions