Reputation: 1847
Consider this toy example where I compute the norms of all the columns of a random matrix M
julia> M = rand(Float64, 10000, 10000);
julia> @time map(x -> norm(x), M[:,j] for j in 1:size(M)[2]);
0.363795 seconds (166.70 k allocations: 770.086 MiB, 27.78% gc time)
Then compute the row norms
julia> @time map(x -> norm(x), M[:,i] for i in 1:size(M)[1]);
1.288872 seconds (176.19 k allocations: 770.232 MiB, 0.37% gc time)
The factor between the two executions is due (I think) to the memory layout of the matrix (column-major). Indeed the computation of the row norms is a loop on non-contiguous data, which leads to non-vectorized code with cache miss. I would like to have the same execution time for both norms computations.
Is possible to convert the layout of M
to row major to get the same speed when calculating the norms of the rows ?
I tried with transpose
and permutedims
without success, it seems that when using these functions the memory is now in row-major (so columns major of the original matrix).
julia> Mt = copy(transpose(M));
julia> @time map(x -> norm(x), Mt[j,:] for j in 1:size(M)[2]);
1.581778 seconds (176.19 k allocations: 770.230 MiB)
julia> Mt = copy(permutedims(M,[2,1]));
julia> @time map(x -> norm(x), Mt[j,:] for j in 1:size(M)[2]);
1.454153 seconds (176.19 k allocations: 770.236 MiB, 9.98% gc time)
I used copy
here to try to force the new layout.
How can I force the column-major layout of the transposition, or the row-major layout of the original matrix ?
As pointed out by @mcabbott and @przemyslaw-szufel there was an error in the last code I showed, I computed the norms of the rows of Mt
instead of the norms of the columns.
The test on the norms of columns of Mt
give instead:
julia> Mt = transpose(M);
julia> @time map(x -> norm(x), M[:,j] for j in 1:size(M)[2]);
1.307777 seconds (204.52 k allocations: 772.032 MiB, 0.45% gc time)
julia> Mt = permutedims(M)
julia> @time map(x -> norm(x), M[:,j] for j in 1:size(M)[2]);
0.334047 seconds (166.53 k allocations: 770.079 MiB, 1.42% gc time)
So in the end it seems that the permutedims
stores in column-major, as it would be expected. In fact the Julia arrays are always stored in column-major. transpose
is kind of an exception because it's a row-major view
of a column-major stored matrix.
Upvotes: 4
Views: 981
Reputation: 42264
There are several problems here:
@time
twice or use BenchmarkTools insteadjulia> using LinearAlgebra, BenchmarkTools
julia> const M = rand(10000, 10000);
julia> @btime map(norm, @view M[:,j] for j in 1:size(M)[2]);
49.378 ms (2 allocations: 78.20 KiB)
julia> @btime map(norm, @view M[i, :] for i in 1:size(M)[1]);
1.013 s (2 allocations: 78.20 KiB)
Now the question about about data layout.
Julia is using a column-major memory layout. Hence the operations that work on columns will be faster than those that work on rows.
One possible workaround would be to have a transposed copy of M
:
const Mᵀ = collect(M')
This requires some time for copying but allows you later to match the performance:
julia> @btime map(norm, @view Mᵀ[:,j] for j in 1:size(M)[2]);
48.455 ms (2 allocations: 78.20 KiB)
julia> map(norm, Mᵀ[:,j] for j in 1:size(M)[2]) == map(norm, M[i,:] for i in 1:size(M)[1])
true
Upvotes: 3
Reputation: 12664
You are wasting a lot of time on creating copies of each column/row when calculating the norms. Use view
s instead, or better yet, eachcol
/eachrow
, which also do not allocate:
julia> M = rand(1000, 1000);
julia> @btime map(norm, $M[:,j] for j in 1:size($M, 2)); # slow and ugly
946.301 μs (1001 allocations: 7.76 MiB)
julia> @btime map(norm, eachcol($M)); # fast and nice
223.199 μs (1 allocation: 7.94 KiB)
julia> @btime norm.(eachcol($M)); # even nicer, but allocates more for some reason.
227.701 μs (3 allocations: 47.08 KiB)
Upvotes: 2