Reputation: 35
I've got a function func
that may cost ~50s when running on a single core. Now I want to run it on a server which has got 192-core CPUs for many times. But when I add worker processes to say, 180, the performance of each core slows down. The worst CPU takes ~100s to calculate func
.
Can someone help me, please?
Here is the pseudo code
using Distributed
addprocs(180)
@everywhere include("func.jl") # defines func in every process
First try using only 10 workers
@sync @distributed for i in 1:10
func()
end
@sync @distributed for i in 1:10
@time func()
end
From worker #: 43.537886 seconds (243.58 M allocations: 30.004 GiB, 8.16% gc time)
From worker #: 44.242588 seconds (247.59 M allocations: 30.531 GiB, 7.90% gc time)
From worker #: 44.571170 seconds (246.26 M allocations: 30.338 GiB, 8.81% gc time)
...
From worker #: 45.259822 seconds (252.19 M allocations: 31.108 GiB, 8.25% gc time)
From worker #: 46.746692 seconds (246.36 M allocations: 30.346 GiB, 11.21% gc time)
From worker #: 47.451914 seconds (248.94 M allocations: 30.692 GiB, 8.96% gc time)
Seems not bad when using 10 workers
Now we use 180 workers
@sync @distributed for i in 1:180
func()
end
@sync @distributed for i in 1:180
@time func()
end
From worker #: 55.752026 seconds (245.20 M allocations: 30.207 GiB, 9.33% gc time)
From worker #: 57.031739 seconds (245.00 M allocations: 30.176 GiB, 7.70% gc time)
From worker #: 57.552505 seconds (247.76 M allocations: 30.543 GiB, 7.34% gc time)
...
From worker #: 96.850839 seconds (247.33 M allocations: 30.470 GiB, 7.95% gc time)
From worker #: 97.468060 seconds (250.04 M allocations: 30.827 GiB, 6.96% gc time)
From worker #: 98.078816 seconds (250.55 M allocations: 30.883 GiB, 10.87% gc time)
The time increases almost linearly from 55s to 100s.
I've checked by top
command that CPU usage may not the bottleneck ("id" keeps >2%). The RAM usage, too (used ~20%).
Other version information: Julia Version 1.5.3 Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
Update:
func
to minimal example (simple for loop) does not change the slowdown.The new pseudo code is
addprocs(96)
@everywhere function ss()
sum=0
for i in 1:1000000000
sum+=sin(i)
end
end
@sync @distributed for i in 1:10
ss()
end
@sync @distributed for i in 1:10
@time ss()
end
From worker #: 32.8 seconds ..(8 others).. 34.0 seconds
...
@sync @distributed for i in 1:96
@time ss()
end
From worker #: 38.1 seconds ..(94 others).. 45.4 seconds
Upvotes: 3
Views: 479
Reputation: 42194
You are measuring the time it takes each worker to perform func()
and observe performance decrease for a single process when going from 10 processes to 180 parallel processes.
This looks quite normal to me:
Looking at this data I would recommend you to try:
addprocs(192 ÷ 2)
and see how the performance is then going to change.
Upvotes: 1