rcpinto
rcpinto

Reputation: 4298

Pairwise weighted distance vectorization

The following efficient and vectorized Matlab code computes the Weighted Euclidean Distance between 2 sets of points A and B using a weight vector WTS (1 weight for each dimension; same weights for all points):

    WTS = sqrt(WTS); 

    % modify A and B against weight values
    A = WTS(ones(1,size(A,1)),:).*A;
    B = WTS(ones(1,size(B,1)),:).*B; 

    % calculate distance
    AA = sum(A.*A,2);  
    BB = sum(B.*B,2)'; 
    D = sqrt(AA(:,ones(1,size(B,1))) + BB(ones(1,size(A,1)),:) - 2*A*B'); 

(source: https://github.com/nolanbconaway/pairdist/blob/master/pairdist.m)

My question is: is there an efficient vectorized form (Matlab, R or Julia are fine) for a similar computation with the difference that WTS is a set of weight vectors with the same size as A? In other words, instead of 1 weight vector, I need 1 weight vector for each point in A.

This answer seems to do what I need, but it is in Python and I'm not sure about how to convert it to Matlab/R/Julia: https://stackoverflow.com/a/19285289/834518

Also, not a duplicate of Efficiently calculating weighted distance in MATLAB, as that question deals with the single weight vector case and I'm explicitly asking for the N weight vectors case.

EDIT: example aplications: RBF Networks and Gaussian Mixture Models, where you (can) have 1 weight vector for each neuron/component. An efficient solution to the problem is essential for those kinds of problems.

Upvotes: 3

Views: 963

Answers (4)

rahnema1
rahnema1

Reputation: 15837

Here is a vectorized version in MATLAB(R2016b and later versions):

W2 = 1./W.^2;
D = sqrt(sum((A./W).^2 ,2) - 2 * (A .* W2) * B.' +W2 * (B.^2).');

In pre R2016b versions You can use this:

W2 = 1./W.^2;
D = sqrt(bsxfun(@plus,sum((A./W).^2 ,2) , -2 * (A .* W2) * B.' +W2 * (B.^2).'));

Translation of MATLAB to julia:

W2 = 1./W.^2;
z=sqrt.(broadcast(+,sum((A./W).^2 ,2) , -2 * (A .* W2) * B.' .+W2 * (B.^2).'));

Here my proposed method,Vectorization, is compared to the Loop method provided by @DanGetz. Other solutions are not applicable here.

Distance computation comparison

We can see that for dimensions less than 128 the loop version is faster than the vectorized version. Performance of the loop version would get worse as number of dimensions increases.

The following code used to produce the figure:

function pdist_vectorized (A::Matrix, B::Matrix, W::Matrix)
    W2 = 1./W.^2;
    return sqrt.(broadcast(+,sum((A./W).^2 ,2) , -2 * (A .* W2) * B.' .+W2 * (B.^2).'));
end

result = zeros(10,2);
for i = 1:10
    A = rand( 3000, 2^i);
    B = rand( 2000, 2^i);
    W = ones(size(A));
    result[i,1]=(@timed pdist_1alloc(A,B,W))[2];
    result[i,2]=(@timed pdist_vectorized(A,B,W))[2];
end

using Plots
pyplot()
plot(2.^(1:10), result, title="Pairwise Weighted Distance",
    label=["Loop" "Vectorization"], lw=3,
    xlabel = "Dimension", ylabel = "Time Elapsed(seconds)")

Upvotes: 3

Dan Getz
Dan Getz

Reputation: 18217

Another version optimized to allocate the result matrix and nothing else:

function pdist_1alloc(A::Matrix, B::Matrix, W::Matrix)
    LA, LD = size(A) ; LB = size(B,1)
    res = zeros(LB, LA)
    indA = 0 ; indB = 0 ; indres = 0
    @inbounds for i=1:LD
        for j=1:LA
            a = A[indA+j] ; w = W[indA+j] ; a2w = a^2*w ; awtmp = -2.0*a*w
            for k=1:LB
                indres += 1
                b = B[indB+k] ; b2w = b^2*w
                res[indres] += a2w+awtmp*b+b2w
            end
        end
        indA += LA ; indB += LB ; indres = 0
    end
    res .= sqrt.(res)
    return res
end

It is about 2x faster than @rahnema1's version, and uses the same tricks, but not as readable. In addition, I apologize for misunderstanding the exact setup of the question in the first place (and suggesting Distance.jl which is not directly applicable here).

Upvotes: 2

juliohm
juliohm

Reputation: 3779

As an additional information for future readers, the package Distances.jl has efficient implementations of most distances you can think of. As a general advise, if an operation is very common in scientific computing, there will be a package implementing it well.

using Distances

D = pairwise(WeightedEuclidean(weights), A, B)

Upvotes: 3

Chris Rackauckas
Chris Rackauckas

Reputation: 19132

In Julia you don't have to vectorize it to be efficient, just write the loop and it'll be faster than these vectorized forms anyways because it can fuse and get rid of the temporaries. Here's a pretty efficient implementation of pairwise applies in Julia that you can work from. It has all of the bells and whistles, but you can pair it down if you want.

Note that vectorization isn't necessarily "fast", it's just faster than looping in R/Python/MATLAB because it's only doing a single function call into an optimized kernel written in a lower level language (C/C++) which is actually looping. But putting together vectorized functions usually has a lot of temporary allocations since each vectorized functions return arrays. Thus if you really need efficiency, you should avoid vectorizing in general and write it in a language that allows for low cost function calls / loops. This post explains more about issues with vectorization in high level languages.

That answers one of the three questions you have. I don't have a good answer for MATLAB or R.

Upvotes: 5

Related Questions