Reputation: 21243
I have the following simple code:
function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
minsofar = fill(n, Threads.nthreads())
# minsofar = n
Threads.@threads for i in 1:N
# for i in 1:N
for j in i+1:N
dist = hamming4(strings[i], strings[j])
if dist < minsofar[Threads.threadid()]
minsofar[Threads.threadid()] = dist
end
end
end
return minimum(minsofar)
#return minsofar
end
function ave_min(n, N)
ITER = 10
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = new_min
# print("New min ", new_min, ". New ave ", avesofar, "\n")
total = avesofar
for i in 1:ITER-1
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = avesofar*(i/(i+1)) + new_min/(i+1)
print("Iteration ", i, ". New min ", new_min, ". New ave ", round(avesofar; digits=2), "\n")
end
return avesofar
end
N = 2^16
n = 99
print("Overall average ", ave_min(n, N), "\n")
When I run it on an AMD 8350 in linux the CPU usage is around 430% (instead of close to 800%).
Is it possible to make the parallelisation work more efficiently?
Also, I noticed a new very impressive looking package called LoopVectorization.jl. As I am computing the Hamming distance in what looks like a vectorizable way, is it possible to speed up the code this way too?
Can the code be vectorized using LoopVectorization.jl?
(I am completely new to Julia)
Upvotes: 2
Views: 162
Reputation: 42214
The parallelization of your code seems to be correct.
Most likely you are running it in Atom or other IDE. Atom by default is using only half o cores (more exactly using only physical not logical cores).
Eg.g running in Atom on my machine:
julia> Threads.nthreads()
4
What you need to do is to explicitly set JULIA_NUM_THREADS
Windows command line (still assuming 8 logical cores)
set JULIA_NUM_THREADS=8
Linux command line
export JULIA_NUM_THREADS=8
After doing that your code takes 100% on all my cores.
After discussion - you can get the time down to around 20% of a single threaded time by using Distributed
instead of Threads
since this avoids memory sharing:
The code will look more or less like this:
using Distributed
addprocs(8)
@everywhere function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
return @distributed (min) for i in 1:N-1
minimum(hamming4(strings[i], strings[j]) for j in i+1:N)
end
end
### ... the rest of code remains unchanged
Upvotes: 4