Reputation: 11209
The following code calculates the average number of draws to get 50 unique cards from several sets. All what is important is that this problem does not require much RAM and does not share any variable when launched in multi-threading mode. When launched with four more than one thread to perform 400,000 simulations it consistently takes about an extra second than two processes launched together and performing 200,000 simulations. This has been bothering me and I could not find any explanation.
This is the Julia code in epic_draw_multi_thread.jl:
using Random
using Printf
import Base.Threads.@spawn
function pickone(dist)
n = length(dist)
i = 1
r = rand()
while r >= dist[i] && i<n
i+=1
end
return i
end
function init_items(type_dist, unique_elements)
return zeros(Int32, length(type_dist), maximum(unique_elements))
end
function draw(type_dist, unique_elements_dist)
item_type = pickone(type_dist)
item_number = pickone(unique_elements_dist[item_type])
return item_type, item_number
end
function draw_unique(type_dist, unique_elements_dist, items, x)
while sum(items .> 0) < x
item_type, item_number = draw(type_dist, unique_elements_dist)
items[item_type, item_number] += 1
end
return sum(items)
end
function average_for_unique(type_dist, unique_elements_dist, x, n, reset=true)
println(@sprintf("Started average_for_unique on thread %d with n = %d", Threads.threadid(), n))
items = init_items(type_dist, unique_elements)
tot_draws = 0
for i in 1:n
tot_draws += draw_unique(type_dist, unique_elements_dist, items, x)
if reset
items .= 0
else
items[items.>1] -= 1
end
end
println(@sprintf("Completed average_for_unique on thread %d with n = %d", Threads.threadid(), n))
return tot_draws / n
end
function parallel_average_for_unique(type_dist, unique_elements_dist, x, n, reset=true)
println("Started computing...")
t = max(Threads.nthreads() - 1, 1)
m = Int32(round(n / t))
tasks = Array{Task}(undef, t)
@sync for i in 1:t
task = @spawn average_for_unique(type_dist, unique_elements_dist, x, m)
tasks[i] = task
end
sum(fetch(t) for t in tasks) / t
end
type_dist = [0.3, 0.3, 0.2, 0.15, 0.05]
const cum_type_dist = cumsum(type_dist)
unique_elements = [21, 27, 32, 14, 10]
unique_elements_dist = [[1 / unique_elements[j] for i in 1:unique_elements[j]] for j in 1:length(unique_elements)]
const cum_unique_elements_dist = [cumsum(dist) for dist in unique_elements_dist]
str_n = ARGS[1]
n = parse(Int64, str_n)
avg = parallel_average_for_unique(cum_type_dist, cum_unique_elements_dist, 50, n)
print(avg)
This is the command issued at the shell to run on two threads along with the output and timing results:
time julia --threads 3 epic_draw_multi_thread.jl 400000
Started computing...
Started average_for_unique on thread 3 with n = 200000
Started average_for_unique on thread 2 with n = 200000
Completed average_for_unique on thread 2 with n = 200000
Completed average_for_unique on thread 3 with n = 200000
70.44460749999999
real 0m14.347s
user 0m26.959s
sys 0m2.124s
These are the command issued at the shell to run two processes with half the job size each along with the output and timing results:
time julia --threads 1 epic_draw_multi_thread.jl 200000 &
time julia --threads 1 epic_draw_multi_thread.jl 200000 &
Started computing...
Started computing...
Started average_for_unique on thread 1 with n = 200000
Started average_for_unique on thread 1 with n = 200000
Completed average_for_unique on thread 1 with n = 200000
Completed average_for_unique on thread 1 with n = 200000
70.434375
real 0m12.919s
user 0m12.688s
sys 0m0.300s
70.448695
real 0m12.996s
user 0m12.790s
sys 0m0.308s
No matter how many times I repeat the experiment, I always get the multi-threaded mode slower. Notes:
t = max(Threads.nthreads() - 1, 1)
can be changed to t = Threads.nthreads()
to use the exact number of threads available.EDIT on 11/20/2020
Implemented Przemyslaw Szufel recommendations. This is the new code:
using Random
using Printf
import Base.Threads.@spawn
using BenchmarkTools
function pickone(dist, mt)
n = length(dist)
i = 1
r = rand(mt)
while r >= dist[i] && i<n
i+=1
end
return i
end
function init_items(type_dist, unique_elements)
return zeros(Int32, length(type_dist), maximum(unique_elements))
end
function draw(type_dist, unique_elements_dist, mt)
item_type = pickone(type_dist, mt)
item_number = pickone(unique_elements_dist[item_type], mt)
return item_type, item_number
end
function draw_unique(type_dist, unique_elements_dist, items, x, mt)
while sum(items .> 0) < x
item_type, item_number = draw(type_dist, unique_elements_dist, mt)
items[item_type, item_number] += 1
end
return sum(items)
end
function average_for_unique(type_dist, unique_elements_dist, x, n, mt, reset=true)
println(@sprintf("Started average_for_unique on thread %d with n = %d", Threads.threadid(), n))
items = init_items(type_dist, unique_elements)
tot_draws = 0
for i in 1:n
tot_draws += draw_unique(type_dist, unique_elements_dist, items, x, mt)
if reset
items .= 0
else
items[items.>1] -= 1
end
end
println(@sprintf("Completed average_for_unique on thread %d with n = %d", Threads.threadid(), n))
return tot_draws / n
end
function parallel_average_for_unique(type_dist, unique_elements_dist, x, n, reset=true)
println("Started computing...")
t = max(Threads.nthreads() - 1, 1)
mts = MersenneTwister.(1:t)
m = Int32(round(n / t))
tasks = Array{Task}(undef, t)
@sync for i in 1:t
task = @spawn average_for_unique(type_dist, unique_elements_dist, x, m, mts[i])
tasks[i] = task
end
sum(fetch(t) for t in tasks) / t
end
type_dist = [0.3, 0.3, 0.2, 0.15, 0.05]
const cum_type_dist = cumsum(type_dist)
unique_elements = [21, 27, 32, 14, 10]
unique_elements_dist = [[1 / unique_elements[j] for i in 1:unique_elements[j]] for j in 1:length(unique_elements)]
const cum_unique_elements_dist = [cumsum(dist) for dist in unique_elements_dist]
str_n = ARGS[1]
n = parse(Int64, str_n)
avg = @btime parallel_average_for_unique(cum_type_dist, cum_unique_elements_dist, 50, n)
print(avg)
Updated benchmarks:
Threads @btime Linux Time
1 (2 processes) 9.927 s 0m44.871s
2 (1 process) 20.237 s 1m14.156s
3 (1 process) 14.302 s 1m2.114s
Upvotes: 2
Views: 607
Reputation: 42194
There are two problems here:
MersenneTwister
random state for each thread for the best performance (otherwise your random state is shared across all threads and synchronization needs to occur)Currently you are measuring the time of "Julia starting time" + "code compile time" + "runtime". Compilation of a multi-threaded code obviously takes longer than compilation of a single-threaded code. And starting Julia itself also takes a second or two.
You have two option here. The easiest is to use BenchmarkTools
@btime
macro to measure execution times inside the code.
Another option would be to make your code into a package and compile it into a Julia image via PackageCompiler. You will be still however measuring "Julia start time" + "Julia execution time"
The random number state could be created as:
mts = MersenneTwister.(1:Threads.nthreads());
and then used such as rand(mts[Threads.threadid()])
Upvotes: 5