Efficient way to calculate part of a multivariate normal density

Question

I am interested in calculating quantity:

$(\boldsymbol{x_i}-\boldsymbol{\mu})^T \boldsymbol{W} (\boldsymbol{x_i}-\boldsymbol{\mu})$

where x_i is a 1xD vector (one out of my N data of dimension D), μ is a DxK matrix and W is a list of K DxD matrices.

This should result in a 1XK vector. I try it for all N and K in the following way that works:

res = zeros(N,K);
for i in 1:N
    for k in 1:K
        res[i,k] = (x_matrix[i,:]-mus_matrix[:,k])'*
                   w_matrix[k]*(x_matrix[i,:]-mus_matrix[:,k])

If I try to vectorize it, using the following:

 res = zeros(N,K);
for i in 1:N
        res[i,:] = (x_matrix[i,:].-mus_matrix)'.*w_matrix.*(x_matrix[i,:].-mus_matrix)

I get the following error:

ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
 [1] _bcs1(::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at ./broadcast.jl:70
 [2] _bcs at ./broadcast.jl:63 [inlined]
 [3] broadcast_shape(::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Vararg{Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},N} where N) at ./broadcast.jl:57 (repeats 3 times)
 [4] broadcast_indices(::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:53
 [5] broadcast_c(::Function, ::Type{Array}, ::Array{Float64,2}, ::Array{Any,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:311
 [6] broadcast(::Function, ::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:434

Here is an example:

julia> N = 5
5

julia> D=2
2

julia> K = 4
4

julia> W=[]
0-element Array{Any,1}

julia> x = rand(N,D)
5×2 Array{Float64,2}:
 0.576477  0.9575  
 0.184454  0.660436
 0.470267  0.729649
 0.648879  0.782561
 0.626453  0.111332

julia> mu = rand(K,D)
4×2 Array{Float64,2}:
 0.989281  0.00126782
 0.659106  0.66136   
 0.50843   0.289442  
 0.327962  0.523229  

julia> for i in 1:K
           push!(W,rand(D,D))
       end

And then run

julia> (x_matrix[i,:]-mus_matrix[:,k])'*
                               w_matrix[k]*(x_matrix[i,:]-mus_matrix[:,k])
34649.850360744866

But with the second code

julia> (x_matrix[i,:].-mus_matrix)'.*w_matrix.*(x_matrix[i,:].-mus_matrix)
ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
 [1] _bcs1(::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at ./broadcast.jl:70
 [2] _bcs at ./broadcast.jl:63 [inlined]
 [3] broadcast_shape(::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Vararg{Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},N} where N) at ./broadcast.jl:57 (repeats 3 times)
 [4] broadcast_indices(::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:53
 [5] broadcast_c(::Function, ::Type{Array}, ::Array{Float64,2}, ::Array{Any,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:311
 [6] broadcast(::Function, ::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:434

phipsgabler · Accepted Answer

TL/DR: optimized variant below, but Einsum looks nicer, IMHO.

Looks like a case for using Einstein summation notation. In Julia, Einsum.jl can do this:

julia> N = 5
5

julia> D = 3
3

julia> K = 10
10

julia> x = rand(N, D)
5×3 Array{Float64,2}:
 0.587436  0.210529  0.261725
 0.527269  0.457477  0.482939
 0.52726   0.411209  0.138872
 0.89107   0.464789  0.758392
 0.885267  0.931014  0.672959

julia> μ = rand(D, K)
3×10 Array{Float64,2}:
 0.280792   0.265066   0.81437   0.503377  0.0717916  …  0.275872  0.609961   0.0820088  0.0042564
 0.0177643  0.0959438  0.563948  0.332433  0.088527      0.691971  0.0296638  0.604488   0.956057 
 0.668128   0.444816   0.74203   0.518232  0.48689       0.465067  0.117469   0.729514   0.109973 

julia> W = rand(K, D, D)
10×3×3 Array{Float64,3}:
[:, :, 1] =
 0.320861   0.662103  0.219234
 0.780944   0.769377  0.566203
 0.466207   0.428527  0.330901
 0.15534    0.035435  0.346737
 0.810676   0.328116  0.469505
 0.676575   0.668204  0.285334
 0.455551   0.211295  0.85295 
 0.229995   0.741487  0.783361
 0.0937583  0.401419  0.47032 
 0.956335   0.434213  0.967791

[:, :, 2] =
 0.275903   0.130298   0.184485
 0.941648   0.940107   0.439454
 0.425292   0.252654   0.797115
 0.0203406  0.594075   0.484809
 0.164309   0.941597   0.455314
 0.73628    0.109502   0.920664
 0.906305   0.177235   0.540193
 0.360038   0.0486971  0.20626 
 0.914357   0.699901   0.295872
 0.284143   0.659117   0.291479

[:, :, 3] =
 0.138311   0.921371  0.353719
 0.345247   0.70865   0.246736
 0.361364   0.636543  0.343837
 0.752149   0.581561  0.346399
 0.705888   0.24765   0.703952
 0.992327   0.369668  0.109407
 0.341624   0.223715  0.970667
 0.762169   0.94248   0.917569
 0.0367128  0.589345  0.121106
 0.826602   0.692111  0.229499

julia> using Einsum

julia> @einsum r[n,k] := (x[n,i] - μ[i,k]) * W[k,i,j] * (x[n,j] - μ[j,k])

julia> r
5×10 Array{Float64,2}:
  0.0176889  0.087092   0.522184    0.0417967   …  -0.0430999   0.041266   -0.0596579  0.432076
  0.0521066  0.364059   0.181515    0.00434307     -0.0248712   0.226976   -0.0686294  0.437169
 -0.0472136  0.127803   0.458812    0.0119074       0.0391649  -0.0190299  -0.0585371  0.264379
  0.468634   1.16498   -0.00263205  0.192809        0.273537    1.13787    -0.0653081  1.41321 
  0.749655   2.20266    0.0205068   0.420249        0.573358    1.42499     0.441232   1.67574

Which @macroexpands to essentially the following loops (plus preparation and bounds checking):

begin  
    local k 
    for k = 1:size(μ, 2) 
        begin  
            local n 
            for n = 1:size(x, 1) 
                begin  
                    local s = zero(T) 
                    begin  
                        local j 
                        for j = 1:size(W, 3) 
                            begin  
                                local i 
                                for i = 1:size(x, 2) 
                                    s += (x[n, i] - μ[i, k]) * W[k, i, j] * (x[n, j] - μ[j, k])
                                end
                            end
                        end
                    end 
                    r[n, k] = s
                end
            end
        end
    end
end

Now, to find something more performant, I compared a couple of variants using BenchmarkTools.jl. You can see the full code and results on my laptop here. It shows that the Einsum variant already is in fact better than the original:

# Original: 
#   memory estimate:  1017.73 MiB
#   allocs estimate:  3429967
#   median time:      361.982 ms (15.94% GC)

# Einsum: 
#   memory estimate:  2.64 MiB
#   allocs estimate:  76
#   median time:      127.536 ms (0.00% GC)

By far the most efficient and least allocating variant is the following, which requires x = x' and W = permutedims(W, [2, 3, 1]) (assuming you can change your representation easily):

function test_optimized!(res, x, μ, W)
    z = zero(eltype(x))

    for k = 1:size(μ, 1) 
        for n = 1:size(x, 1)
            res[n, k] = z

            for i = 1:size(W, 1)
                for j = 1:size(W, 2)
                    @inbounds res[n, k] += (x[i, n] - μ[i, k]) * W[i, j, k] * (x[j, n] - μ[j, k])
                end
            end
        end
    end
end

function test_optimized(x, μ, W)
    res = zeros(N, K)
    test_optimized!(res, x, μ, W)
    res
end

This brings us down to

#   memory estimate:  2.63 MiB
#   allocs estimate:  2
#   median time:      521.215 μs (0.00% GC)

It uses a couple of "tricks" that can be found in the docs: filling a preallocated matrix in a separate method, accessing strides in column-major order, and using @inbounds (although that only improves things at the order of a microsecond).

There is also TensorOperations.jl, which I think does more intelligent things under the hood, but it fail on this:

julia> @tensor r[n,k] := (x[n,i] - μ[i,k]) * W[k,i,j] * (x[n,j] - μ[j,k])
ERROR: TensorOperations.IndexError{String}("invalid index specification: (:n, :i) to (:i, :k)")
Stacktrace:
 [1] add_indices(::Tuple{Symbol,Symbol}, ::Tuple{Symbol,Symbol}) at /home/philipp/.julia/v0.6/TensorOperations/src/implementation/indices.jl:22
 [2] + at /home/philipp/.julia/v0.6/TensorOperations/src/indexnotation/sum.jl:40 [inlined]
 [3] -(::TensorOperations.IndexedObject{(:n, :i),:N,Array{Float64,2},Int64}, ::TensorOperations.IndexedObject{(:i, :k),:N,Array{Float64,2},Int64}) at /home/philipp/.julia/v0.6/TensorOperations/src/indexnotation/sum.jl:44

I guess that's deliberate and has to do with efficiency, see this issue.

Efficient way to calculate part of a multivariate normal density

Answers (1)

Related Questions