Reputation: 7347
I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.
Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.
I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.
using BenchmarkTools
function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 8.034s
Threads.@threads
on 4 threadsusing BenchmarkTools
using Base.Threads
function slow(n::Int, digits::String)
total = Atomic{Float64}(0)
@threads for i in 1:n
if !occursin(digits, string(i))
atomic_add!(total, 1.0 / i)
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 6.938s
Speedup: 1.2
using BenchmarkTools
using FLoops
function slow(n::Int, digits::String)
total = 0.0
@floop for i in 1:n
if !occursin(digits, string(i))
@reduce(total += 1.0 / i)
end
end
println("total = ", total)
end
@btime slow(Int64(1e8), "9")
Time: 10.850s
No speedup: slower than the sequential code.
I tested the sequential and Threads.@threads
code on a different machine and experimented with various numbers of threads.
Here are the results:
Number of threads | Speedup |
---|---|
2 | 1.2 |
4 | 1.2 |
8 | 1.0 (no speedup) |
16 | 0.9 (the code takes longer to run than the sequential code) |
For heavier computations (n = 1e9
in the code above) which would minimize the relative effect of any overhead, the results are very similar:
Number of threads | Speedup |
---|---|
2 | 1.1 |
4 | 1.3 |
8 | 1.1 |
16 | 0.8 (the code takes longer to run than the sequential code) |
Code run with Chapel v1.23.0:
use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
if (i: string).find(digits) == -1 then
total += 1.0 / i;
}
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
total, watch.elapsed());
First run (same hardware as the first Julia tests):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 13.33 | n/a |
2 | 7.34 | 1.8 |
Second run (same hardware):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 13.59 | n/a |
2 | 6.83 | 2.0 |
Third run (different hardware):
Number of threads | Time (s) | Speedup |
---|---|---|
1 | 19.99 | n/a |
2 | 10.06 | 2.0 |
4 | 5.05 | 4.0 |
8 | 2.54 | 7.9 |
16 | 1.28 | 15.6 |
Upvotes: 14
Views: 3945
Reputation: 43
As pointed out by previous answer, I also found the performance of multi-threading in Julia is largely influenced by garbage collection.
I used a simple trick by adding GC.gc()
before the multi-threading task to "clean" the previous garbage. Note: this only works when the memory allocation is not too large.
BTW, you can use GC.enable_logging(true)
to get the idea of how long GC takes (It is huge in my code!)
Upvotes: 2
Reputation: 2162
As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8}
instead of String
:
using BenchmarkTools
using Transducers
using Distributed
function string_vector!(a::Vector{UInt8}, x::Unsigned)
n = ndigits(x)
length(a) < n && error("Vector too short")
i = n
@inbounds while i >= 1
d, r = divrem(x, 0x0a)
a[i] = 0x30 + r
x = oftype(x, d)
i -= 1
end
a
end
function slow_no_garbage(n::UInt, digits::String)
digits = collect(codeunits(digits))
thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 0x1:n)
"total = $total"
end
println(@btime slow_no_garbage(UInt(1e8), "9"))
I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).
As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:
using BenchmarkTools
using Transducers
function slow_transducers(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 1:n)
"total = $total"
end
println(@btime slow_transducers(Int64(1e8), "9"))
and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):
using BenchmarkTools
using Transducers
using Distributed
function slow_distributed(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxd(+, Map(fun), 1:n)
"total = $total"
end
addprocs(parse(Int, ARGS[1]))
println(@btime slow_distributed(Int64(1e8), "9"))
The following table shows the results of running all versions with different number of threads/processes:
n | jling | slow_transducers |
slow_distributed |
slow_no_garbage |
Chapel |
---|---|---|---|---|---|
1 | 4.242 s | 4.224 s | 4.241 s | 2.743 s | 7.32 s |
2 | 2.952 s | 2.958 s | 2.168 s | 1.447 s | 3.73 s |
4 | 2.135 s | 2.147 s | 1.163 s | 0.716105 s | 1.9 s |
8 | 1.742 s | 1.741 s | 0.859058 s | 0.360469 s |
Speedup:
n | jling | slow_transducers |
slow_distributed |
slow_no_garbage |
Chapel |
---|---|---|---|---|---|
1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | 1.43699 | 1.42799 | 1.95618 | 1.89565 | 1.96247 |
4 | 1.98689 | 1.9674 | 3.6466 | 3.83044 | 3.85263 |
8 | 2.43513 | 2.42619 | 4.9368 | 7.60953 |
Upvotes: 4
Reputation: 2301
Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.
Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:
julia> function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
"total = $total"
end
slow (generic function with 1 method)
julia> @btime slow(Int64(1e5), "9")
6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"
julia> using ThreadsX
julia> function slow_thread_thx(n::Int, digits::String)
total = ThreadsX.mapreduce(+,1:n) do i
if !occursin(digits, string(i))
1.0 / i
else
0.0
end
end
"total = $total"
end
julia> @btime slow_thread_thx(Int64(1e5), "9")
1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"
With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.
Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.
Upvotes: 6