multithreadingperformanceparallel-processingjuliachapel

Reputation: 7347

How to make multi-threading in Julia scale with the number of threads?

I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.

Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.

I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.

Sequential code

using BenchmarkTools

function slow(n::Int, digits::String)
    total = 0.0
    for i in 1:n
        if !occursin(digits, string(i))
            total += 1.0 / i
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 8.034s

Shared memory parallelism with `Threads.@threads` on 4 threads

using BenchmarkTools
using Base.Threads

function slow(n::Int, digits::String)
    total = Atomic{Float64}(0)
    @threads for i in 1:n
        if !occursin(digits, string(i))
            atomic_add!(total, 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 6.938s
Speedup: 1.2

Shared memory parallelism with FLoops on 4 threads

using BenchmarkTools
using FLoops

function slow(n::Int, digits::String)
    total = 0.0
    @floop for i in 1:n
        if !occursin(digits, string(i))
            @reduce(total += 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Time: 10.850s
No speedup: slower than the sequential code.

Tests on various numbers of threads (different hardware)

I tested the sequential and Threads.@threads code on a different machine and experimented with various numbers of threads.

Here are the results:

Number of threads	Speedup
2	1.2
4	1.2
8	1.0 (no speedup)
16	0.9 (the code takes longer to run than the sequential code)

For heavier computations (n = 1e9 in the code above) which would minimize the relative effect of any overhead, the results are very similar:

Number of threads	Speedup
2	1.1
4	1.3
8	1.1
16	0.8 (the code takes longer to run than the sequential code)

For comparison: same loop with Chapel showing perfect scaling

Code run with Chapel v1.23.0:

use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
  if (i: string).find(digits) == -1 then
    total += 1.0 / i;
 }
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
        total, watch.elapsed());

First run (same hardware as the first Julia tests):

Number of threads	Time (s)	Speedup
1	13.33	n/a
2	7.34	1.8

Second run (same hardware):

Number of threads	Time (s)	Speedup
1	13.59	n/a
2	6.83	2.0

Third run (different hardware):

Number of threads	Time (s)	Speedup
1	19.99	n/a
2	10.06	2.0
4	5.05	4.0
8	2.54	7.9
16	1.28	15.6

Upvotes: 14

Answers (3)

Junyan Su

Reputation: 43

As pointed out by previous answer, I also found the performance of multi-threading in Julia is largely influenced by garbage collection.

I used a simple trick by adding GC.gc() before the multi-threading task to "clean" the previous garbage. Note: this only works when the memory allocation is not too large.

BTW, you can use GC.enable_logging(true) to get the idea of how long GC takes (It is huge in my code!)

Upvotes: 2

ahnlabb

Reputation: 2162

As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8} instead of String:

using BenchmarkTools
using Transducers
using Distributed

function string_vector!(a::Vector{UInt8}, x::Unsigned)
    n = ndigits(x)
    length(a) < n && error("Vector too short")
    i = n
    @inbounds while i >= 1
        d, r = divrem(x, 0x0a)
        a[i] = 0x30 + r
        x = oftype(x, d)
        i -= 1
    end
    a
end

function slow_no_garbage(n::UInt, digits::String)
    digits = collect(codeunits(digits))
    thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
    fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
            1.0 / i
        else
            0.0
        end
    total = foldxt(+, Map(fun), 0x1:n)
    "total = $total"
end
println(@btime slow_no_garbage(UInt(1e8), "9"))

I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).

As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:

using BenchmarkTools
using Transducers

function slow_transducers(n::Int, digits::String)
    fun(i) = if !occursin(digits, string(i))
            1.0 / i
        else
            0.0
        end
    total = foldxt(+, Map(fun), 1:n)
    "total = $total"
end

println(@btime slow_transducers(Int64(1e8), "9"))

and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):

using BenchmarkTools
using Transducers
using Distributed

function slow_distributed(n::Int, digits::String)
    fun(i) = if !occursin(digits, string(i))
            1.0 / i
        else
            0.0
        end
    total = foldxd(+, Map(fun), 1:n)
   "total = $total"
end

addprocs(parse(Int, ARGS[1]))
println(@btime slow_distributed(Int64(1e8), "9"))

The following table shows the results of running all versions with different number of threads/processes:

n	jling	`slow_transducers`	`slow_distributed`	`slow_no_garbage`	Chapel
1	4.242 s	4.224 s	4.241 s	2.743 s	7.32 s
2	2.952 s	2.958 s	2.168 s	1.447 s	3.73 s
4	2.135 s	2.147 s	1.163 s	0.716105 s	1.9 s
8	1.742 s	1.741 s	0.859058 s	0.360469 s

Speedup:

n	jling	`slow_transducers`	`slow_distributed`	`slow_no_garbage`	Chapel
1	1.0	1.0	1.0	1.0	1.0
2	1.43699	1.42799	1.95618	1.89565	1.96247
4	1.98689	1.9674	3.6466	3.83044	3.85263
8	2.43513	2.42619	4.9368	7.60953

Upvotes: 4

jling

Reputation: 2301

Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.

Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:


julia> function slow(n::Int, digits::String)
           total = 0.0
           for i in 1:n
               if !occursin(digits, string(i))
                   total += 1.0 / i
               end
           end
           "total = $total"
       end
slow (generic function with 1 method)

julia> @btime slow(Int64(1e5), "9")
  6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"

julia> using ThreadsX

julia> function slow_thread_thx(n::Int, digits::String)
           total = ThreadsX.mapreduce(+,1:n) do i
               if !occursin(digits, string(i))
                   1.0 / i
               else
                   0.0
               end
           end
           "total = $total"
       end

julia> @btime slow_thread_thx(Int64(1e5), "9")
  1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"

With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.

Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.

Upvotes: 6