Trock
Trock

Reputation: 318

How to get faster pairwise distance and vector calculations in Julia 0.5 compared to Matlab2016b?

I am running a program that requires repeated use of pairwise distance calculations across a collection of points in 2D and then calculating the vectors. This ends up causing a bottleneck in my run times so I have tried to re-write my code from Matlab to Julia to take advantage of its faster speeds. The problem I have run into, however, is that the function I have written in Julia is actually running five times slower than my Matlab implementation. Given Julia's reputation as being a substantially faster language, I am assuming that I have done something wrong.

I wrote a simple example to illustrate what I am seeing.

Julia code:

using Distances
function foo()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv*cos(th)
    yfv = fv*sin(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end

Matlab code:

function foo
histarray = zeros(5000,150,2);
a = randn(150,2);
for i=1:5000

    pd = pdist2(a,a);
    xgv = -bsxfun(@minus,a(:,1),a(:,1)');
    ygv = -bsxfun(@minus,a(:,2),a(:,2)');
    th = atan2(ygv,xgv);
    fv = 1./(pd+1);
    xfv = fv.*cos(th);
    yfv = fv.*sin(th);
    a(:,1) = a(:,1)+sum(xfv,2);
    a(:,2) = a(:,2)+sum(yfv,2);

    histarray(i,:,:)=a;
end

end

When I test the Julia code's speed (multiple times to account for compiling time) I get:

@time foo()
16.110077 seconds (2.65 M allocations: 8.566 GB, 6.30% gc time)

The Matlab function's performance on the other hand is:

tic
foo
toc
Elapsed time is 3.339807 seconds. 

When I have run the profile viewer on the Julia code the component that takes the most amount of time are lines 9, 11, and 12. Is there perhaps something strange happening with trigonometric functions?

Upvotes: 4

Views: 1052

Answers (4)

Dan Getz
Dan Getz

Reputation: 18217

The idea behind the get-rid-of-trig refactor is sin(atan(x,y))==y/sqrt(x^2+y^2). Conveniently the function hypot calculates the square root denominator. inv is used to get away from slow divisions. The code:

# a constant input matrix to allow foo2/foo3 comparison
a = randn(150,2)

# calculation using trig functions
function foo2(b,n)
  a = copy(b)
  historyarray = zeros(n,size(a,1),2)
  pd = zeros(size(a,1), size(a,1))
  xgv = similar(pd)
  ygv = similar(pd)
  th = similar(pd)
  fv = similar(pd)
  xfv = similar(pd)
  yfv = similar(pd)
  tmp = zeros(size(a,1))
  @views for i in 1:n
      pairwise!(pd, Euclidean(),a.')
      xgv .= a[:,1].' .- a[:,1]
      ygv .= a[:,2].' .- a[:,2]
      th .= atan2.(ygv,xgv)
      fv .= 1./(pd.+1)
      xfv .= fv.*cos.(th)
      yfv .= fv.*sin.(th)
      a[:,1:1] .+= sum!(tmp, xfv)
      a[:,2:2] .+= sum!(tmp, yfv)
      historyarray[i,:,:] = a
  end
end

# helper function to handle annoying Infs from self interaction calc
nantoone(x) = ifelse(isnan(x),1.0,x)
nantozero(x) = ifelse(isnan(x),0.0,x)

# calculations using Pythagoras
function foo3(b,n)
  a = copy(b)
  historyarray = zeros(5000,size(a,1),2)
  pd = zeros(size(a,1), size(a,1))
  xgv = similar(pd)
  ygv = similar(pd)
  th = similar(pd)
  fv = similar(pd)
  xfv = similar(pd)
  yfv = similar(pd)
  tmp = zeros(size(a,1))
  @views for i in 1:n
      pairwise!(pd, Euclidean(),a.')
      xgv .= a[:,1].' .- a[:,1]
      ygv .= a[:,2].' .- a[:,2]
      th .= inv.(hypot.(ygv,xgv))
      fv .= inv.(pd.+1)
      xfv .= nantoone.(fv.*xgv.*th)
      yfv .= nantozero.(fv.*ygv.*th)
      a[:,1:1] .+= sum!(tmp, xfv)
      a[:,2:2] .+= sum!(tmp, yfv)
      historyarray[i,:,:] = a
  end
end

And a benchamrk:

julia> @time foo2(a,5000)
  9.698825 seconds (809.51 k allocations: 67.880 MiB, 0.33% gc time)

julia> @time foo3(a,5000)
  2.207108 seconds (809.51 k allocations: 67.880 MiB, 1.15% gc time)

A >4x improvement.

Another takeaway from this is the convenience of a NaN-to-something function which could be added to Base (similar to coalesce and nvl from SQL world).

Upvotes: 4

Wedg
Wedg

Reputation: 51

I'm able to get a speed up vs matlab using multi threads on julia 0.5.

On my machine (i5 with 4 cores) I get the following times:
matlab R2012a - 8.5 seconds
julia 0.5 single thread - foo3() (see below) - 18.5 seconds
julia 0.5 multi threads - foo4() (see below) - 4.5 seconds

I.e. I'm able to get the julia single threaded function running twice as slow as matlab but a multi threaded function running twice as fast as matlab.

Apologies for this being a very verbose answer - thought it better to be comprehensive. I'm posting below each of the inner functions used as well as the main ones - foo3() and foo(4).

1. Single Thread:

The aim of the devectorised functions below were to avoid unnecessary memory allocations and to make use of the symmetry of the arrays. From tim's answer it looks like much of this can be handled with single lines in 0.6 with the dot notation.

function pdist2!(pd, a)
    m = size(a, 1)
    for col in 1:m
        for row in (col + 1):m
            s = 0.0
            for i in 1:2
                @inbounds s += abs2(a[col, i] - a[row, i])
            end
            @inbounds pd[row, col] = pd[col, row] = sqrt(s)
        end
    end
end

function dotminustranspose!(xgv, ygv, a)
    m = size(a, 1)
    for col in 1:m
        @inbounds for row in (col + 1):m
            xgv[row, col] = a[col, 1] - a[row, 1]
            ygv[row, col] = a[col, 2] - a[row, 2]
            xgv[col, row] = - xgv[row, col]
            ygv[col, row] = - ygv[row, col]
        end
    end
end

function atan2!(th, ygv, xgv)
    for i in eachindex(ygv)
        @inbounds th[i] = atan2(ygv[i], xgv[i])
    end
end

function invpdp1!(fv, pd)
    for i in eachindex(pd)
        @inbounds fv[i] = 1 / (pd[i] + 1)
    end
end

function fv_times_cos_th!(xfv, fv, th)
    for i in eachindex(th)
        @inbounds xfv[i] = fv[i] * cos(th[i])
    end
end

function fv_times_sin_th!(yfv, fv, th)
    for i in eachindex(th)
        @inbounds yfv[i] = fv[i] * sin(th[i])
    end
end

function adsum2!(a, xfv, yfv)
    n = size(a, 1)
    for j in 1:n
        for i in 1:n
            @inbounds a[i, 1] += xfv[i, j]
            @inbounds a[i, 2] += yfv[i, j]
        end
    end
end

function foo3()
    a = reshape(sin(1:300), 150, 2)
    histarray = zeros(5000, 150, 2)
    pd = zeros(size(a, 1), size(a, 1))
    xgv = zeros(pd)
    ygv = zeros(pd)
    th = zeros(pd)
    fv = zeros(pd)
    xfv = zeros(pd)
    yfv = zeros(pd)
    for i in 1:5000
        pdist2!(pd, a)
        dotminustranspose!(xgv, ygv, a)
        atan2!(th, ygv, xgv)
        invpdp1!(fv, pd)
        fv_times_cos_th!(xfv, fv, th)
        fv_times_sin_th!(yfv, fv, th)
        adsum2!(a, xfv, yfv)

        histarray[i, :, :] = view(a, :)
    end
    return histarray
end

Time:

@time histarray = foo3()
17.966093 seconds (24.51 k allocations: 13.404 MB)

1. Multi Thread:

The elementwise trig functions can be multithreaded using the @threads macro. And this gives me about a 4 x speedup. This is still experimental but I tested the outputs and they are identical.

using Base.Threads

function atan2_mt!(th, ygv, xgv)
    @threads for i in eachindex(ygv)
        @inbounds th[i] = atan2(ygv[i], xgv[i])
    end
end

function fv_times_cos_th_mt!(xfv, fv, th)
    @threads for i in eachindex(th)
        @inbounds xfv[i] = fv[i] * cos(th[i])
    end
end

function fv_times_sin_th_mt!(yfv, fv, th)
    @threads for i in eachindex(th)
        @inbounds yfv[i] = fv[i] * sin(th[i])
    end
end

function foo4()
    a = reshape(sin(1:300), 150, 2)
    histarray = zeros(5000, 150, 2)
    pd = zeros(size(a, 1), size(a, 1))
    xgv = zeros(pd)
    ygv = zeros(pd)
    th = zeros(pd)
    fv = zeros(pd)
    xfv = zeros(pd)
    yfv = zeros(pd)
    for i in 1:5000
        pdist2!(pd, a)
        dotminustranspose!(xgv, ygv, a)
        atan2_mt!(th, ygv, xgv)
        invpdp1!(fv, pd)
        fv_times_cos_th_mt!(xfv, fv, th)
        fv_times_sin_th_mt!(yfv, fv, th)
        adsum2!(a, xfv, yfv)

        histarray[i, :, :] = view(a, :)
    end
    return histarray
end

Time:

@time histarray = foo4()
4.569416 seconds (54.51 k allocations: 14.320 MB, 0.20% gc time)

Upvotes: 4

tim
tim

Reputation: 2096

You are correct that the calls to sin, cos, and atan2 are the bottleneck in your Julia code. Nevertheless, the high number of allocations implies that there is still potential for optimizations.

On the upcoming version of Julia, your code can easily be rewritten to avoid unnecessary allocations by using the improved dot broadcasting syntax a .= f.(b,c). This is equivalent to broadcast!(f, a, b, c) and updates a in place. Additionally, several dot broadcast calls on the rhs are automatically fused into a single one. Finally, the @views macro turns all slicing operations like a[:,1], which makes a copy, into views. The new code looks like:

function foo2()
    a = rand(150,2)
    historyarray = zeros(5000,150,2)
    pd = zeros(size(a,1), size(a,1))
    xgv = similar(pd)
    ygv = similar(pd)
    th = similar(pd)
    fv = similar(pd)
    xfv = similar(pd)
    yfv = similar(pd)
    tmp = zeros(size(a,1))
    @views for i in 1:5000
        pairwise!(pd, Euclidean(),a.')
        xgv .= a[:,1].' .- a[:,1]
        ygv .= a[:,2].' .- a[:,2]
        th .= atan2.(ygv,xgv)
        fv .= 1./(pd.+1)
        xfv .= fv.*cos.(th)
        yfv .= fv.*sin.(th)
        a[:,1:1] .+= sum!(tmp, xfv)
        a[:,2:2] .+= sum!(tmp, yfv)
        historyarray[i,:,:] = a
    end
end

(I use element-wise multiplication in xfv .= fv.*cos.(th) like in your Matlab code instead of matrix multiplication.)

Benchmarking the new code shows a strong reduction in allocated memory:

julia> @benchmark foo2()
BenchmarkTools.Trial: 
  memory estimate:  67.88 MiB
  allocs estimate:  809507
  --------------
  minimum time:     7.655 s (0.06% GC)
  median time:      7.655 s (0.06% GC)
  mean time:        7.655 s (0.06% GC)
  maximum time:     7.655 s (0.06% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

(Most of this can be achieved on 0.5, but requires a bit more typing)

However, this still takes twice as long as your Matlab version. Profiling reveals that indeed most of the time is spent in the trigonometric functions.

Just for fun I tried:

const atan2 = +
const cos = x->2x
const sin = x->2x

and got:

julia> @benchmark foo2()
BenchmarkTools.Trial: 
  memory estimate:  67.88 MiB
  allocs estimate:  809507
  --------------
  minimum time:     1.020 s (0.69% GC)
  median time:      1.028 s (0.68% GC)
  mean time:        1.043 s (2.10% GC)
  maximum time:     1.100 s (7.75% GC)
  --------------
  samples:          5
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

I guess, one reason for the slowness of the trigonometric functions might be that I use the prebuilt binary and no self-compiled version of the libm library, which is used by Julia. Therefore, the libm code is not optimized for my processor. But I doubt that this will make Julia much faster than Matlab in this case. The Matlab code just seems to be already close to optimal for this algorithm.

Upvotes: 6

Gawron
Gawron

Reputation: 54

You can use dot notation in order to broadcast some of the operations. Look at foo2() function.

using Distances
function foo1()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv*cos(th)
    yfv = fv*sin(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end


function foo2()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2.(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv.*cos.(th)
    yfv = fv.*sin.(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end

@time foo1()
@time foo2()

Console output:

29.723805 seconds (2.65 M allocations: 8.566 GB, 1.15% gc time)
16.296859 seconds (2.81 M allocations: 8.571 GB, 2.54% gc time)

Upvotes: 1

Related Questions