Simd
Simd

Reputation: 21243

How to get near optimal parallel efficiency for this simple Julia code?

I have the following simple code:

function hamming4(bits1::Integer, bits2::Integer)
    return count_ones(bits1 ⊻ bits2)
end

function random_strings2(n, N)
    mask = UInt128(1) << n - 1
    return [rand(UInt128) & mask for i in 1:N]
end




function find_min(strings, n, N)
    minsofar = fill(n, Threads.nthreads())
    # minsofar = n
    Threads.@threads for i in 1:N 
    # for i in 1:N
        for j in i+1:N
            dist = hamming4(strings[i], strings[j])
            if dist < minsofar[Threads.threadid()]
                    minsofar[Threads.threadid()] = dist

            end
        end
    end
    return minimum(minsofar)
    #return minsofar
end


function ave_min(n, N)
    ITER = 10
    strings = random_strings2(n, N)
    new_min = find_min(strings, n, N)
    avesofar = new_min
    # print("New min ", new_min, ". New ave ", avesofar, "\n")
    total = avesofar
    for i in 1:ITER-1
        strings = random_strings2(n, N)
        new_min = find_min(strings, n, N)
        avesofar = avesofar*(i/(i+1)) + new_min/(i+1)
        print("Iteration ", i, ". New min ", new_min, ". New ave ", round(avesofar; digits=2), "\n")
    end
    return avesofar
end

N = 2^16
n = 99

print("Overall average ", ave_min(n, N), "\n")

When I run it on an AMD 8350 in linux the CPU usage is around 430% (instead of close to 800%).

Is it possible to make the parallelisation work more efficiently?

Also, I noticed a new very impressive looking package called LoopVectorization.jl. As I am computing the Hamming distance in what looks like a vectorizable way, is it possible to speed up the code this way too?

Can the code be vectorized using LoopVectorization.jl?

(I am completely new to Julia)

Upvotes: 2

Views: 162

Answers (1)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42214

The parallelization of your code seems to be correct.

Most likely you are running it in Atom or other IDE. Atom by default is using only half o cores (more exactly using only physical not logical cores).

Eg.g running in Atom on my machine:

julia> Threads.nthreads()
4

What you need to do is to explicitly set JULIA_NUM_THREADS

Windows command line (still assuming 8 logical cores)

set JULIA_NUM_THREADS=8

Linux command line

export JULIA_NUM_THREADS=8

After doing that your code takes 100% on all my cores.

EDIT

After discussion - you can get the time down to around 20% of a single threaded time by using Distributed instead of Threads since this avoids memory sharing:

The code will look more or less like this:

using Distributed
addprocs(8)

@everywhere function hamming4(bits1::Integer, bits2::Integer)
    return count_ones(bits1 ⊻ bits2)
end

function random_strings2(n, N)
    mask = UInt128(1) << n - 1
    return [rand(UInt128) & mask for i in 1:N]
end

function find_min(strings, n, N)
    return @distributed (min) for i in 1:N-1
        minimum(hamming4(strings[i], strings[j]) for j in i+1:N)
    end
end

### ... the rest of code remains unchanged 

Upvotes: 4

Related Questions