Reputation: 437
Complete Julia newbie here.
Given an array of arrays, I want to combine corresponding elements of each sub-array. Something like this:
[2, 7, 9] [2, 3, 2, 7, 3]
[3, 5, 4] [7, 5, 7, 9, 5]
[2, 7, 7] -> [9, 4, 7, 1, 1]
[7, 9, 1]
[3, 5, 1]
Searching stackoverflow I came across a couple of solutions other than a direct loop or a list comprehension.
julia> a=Vector{Int}[rand(1:10,3) for i=1:5]
5-element Array{Array{Int64,1},1}:
[2, 7, 9]
[3, 5, 4]
[2, 7, 7]
[7, 9, 1]
[3, 5, 1]
julia> using BenchmarkTools
julia> @btime a2=mapslices( x -> [x], hcat(a...), dims=2)[:]
6.174 μs (65 allocations: 3.45 KiB)
3-element Array{Array{Int64,1},1}:
[2, 3, 2, 7, 3]
[7, 5, 7, 9, 5]
[9, 4, 7, 1, 1]
julia> @btime a3=[getindex.(a,i) for i=1:length(a[1])]
948.087 ns (14 allocations: 768 bytes)
3-element Array{Array{Int64,1},1}:
[2, 3, 2, 7, 3]
[7, 5, 7, 9, 5]
[9, 4, 7, 1, 1]
My question is: Why is the second one about six times faster than the first one? Does it have to do with hcat?
Upvotes: 1
Views: 704
Reputation: 28232
Ok, first lets establish the baseline on my computer.
Before we do anything else we need to make sure we are not benchmarking on global variables. From the BenchmarkTools readme:
If the expression you want to benchmark depends on external variables, you should use
$
to "interpolate" them into the benchmark expression to avoid the problems of benchmarking with globals. Essentially, any interpolated variable$x
or expression$(...)
is "pre-computed" before benchmarking begins...
julia> a=Vector{Int}[rand(1:10,3) for i=1:5];
julia> @btime a2=mapslices( x -> [x], hcat($a...), dims=2)[:];
6.015 μs (65 allocations: 3.45 KiB)
julia> @btime a3=[getindex.($a,i) for i=1:length($a[1])];
149.228 ns (6 allocations: 544 bytes)
(If i didn't interpolate, I would get the same as you roughly for a3
999.500 ns (14 allocations: 768 bytes)
).
So it is a3
is not 6 times faster, but actually 33 times faster.
Allocations are fairly slow compared to other operations (in all languages).
We can see that the a2
code allocates much more than the a3
code.
So lets look at the bits that allocate:
a2
[x]
allocates a new 1-element array for each columnhcat
allocates a new matrix with everything concatenatedmapslices
allocates an array for each slice it takes out of the matrixmapslice
allocate an array to hold the output (interesting that it does not do a view, but i checked)[:]
performs a reshaped copy of the output. (alternative would be vec
which returns a reshapes view)a3
getindex.(a, i)
allocates an array for each column of the output (so same as mapslice
internal slicing of the input matrix)[ ... for ...]
allocates an array for the output (so same as mapslices output)
So we can see that there are a bunch of extra allocations going on in a2
that are not in a3
.
hcat
.Since the original question asks if it is because of the hcat
lets look at that.
I define a new benchmark saving into a4
.
It uses eachslice
which returns a (lazy) generator of views into slices of the matrix. So negligible allocations there.
To stop it being lazy we collect
it.
The final output of this is an Array
of SubArray
s (rather than an Array
of Arrays
) but that is fine, it will work just as it still subtypes AbstractArray
.
julia> @btime a4 = collect(eachslice(hcat($a...), dims=1));
734.320 ns (13 allocations: 704 bytes)
Here our major allocations are the
- hcat
- collect
which allocates the output (same as [ ... for ...]
).
So yes, the hcat
has an effect, but it is far from the majority of the difference.
reduce(hcat, xs)
Splatting as a cost. Its generally pretty small until you get up to splatting hundreds of items, but because this is a microbenchmark and everthing else is so fast lets see how it goes to remove it.
Julia has an optimized function for reduce(hcat, xs)
for xs
being an array of arrays.
so lets see how that goes:
julia> @btime a2_s=mapslices(x -> [x], reduce(hcat, $a), dims=2);
5.278 μs (59 allocations: 3.17 KiB)
julia> @btime a4_s=collect(eachslice(reduce(hcat, $a), dims=1));
337.656 ns (8 allocations: 528 bytes)
We can see that is it makes a difference.
But in the case of a2
it's not much, since the hcat
is done once, where as the slow allocations in x->x
and in mapslices
copying slices out of the hcat
happens many times.
Not really.
a3
is pretty much the ideal code for this.
It allocates nothing that it does not return.
Thought if we are willing to swap over to using StaticArrays we can get something truly unreasonably fast.
julia> b = @SVector [@SVector [rand(1:10) for ii in 1:3] for i=1:5];
julia> @btime b3=[getindex.($b,i) for i in 1:length($b[1])];
36.055 ns (1 allocation: 208 bytes)
Static arrays gives the compiler a bunch more information. In particular the size of all the arrays, and the promise that none of them will ever be changed. This means it can: - unroll loops - bounds check at compile time - allocate them on the stack (rather than the heap) - probably some other things I have forgotten.
This lets the optimizer (both in Julia and in LLVM) go really wild. The wold them gets compiled down to basically 2 SSE/AVX vectorized move operations per input column (/output row), plus a small amount of fixed overhead.
julia> @code_native (b->[getindex.(b,i) for i in 1:length(b[1])])(b)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ REPL[83]:1 within `#161'
subq $136, %rsp
vmovups (%rdi), %ymm0
vmovups 32(%rdi), %ymm1
vmovups 64(%rdi), %ymm2
vmovups 88(%rdi), %ymm3
vmovups %ymm3, 88(%rsp)
vmovups %ymm2, 64(%rsp)
vmovups %ymm1, 32(%rsp)
vmovups %ymm0, (%rsp)
movabsq $5152370032, %rax ## imm = 0x1331AED70
; │┌ @ generator.jl:32 within `Generator' @ generator.jl:32
vmovaps (%rax), %xmm0
vmovups %xmm0, 120(%rsp)
; │└
movabsq $collect, %rax
movq %rsp, %rdi
vzeroupper
callq *%rax
addq $136, %rsp
retq
nop
; └
Upvotes: 4