Reputation: 11
I wrote a minimal example trying to parallelize some computation in Haskell. I basically tried to map some trig functions onto a long list by
My code:
-- in file par.hs
module Par where
import Control.Parallel
import Data.List
parmap f l = let
halflen = floor $ (realToFrac $ length l) / 2
half1 = take halflen l
half2 = drop halflen l
mapped1 = map f half1
mapped2 = map f half2
in mapped1 `par` (mapped2 `pseq` mapped1 ++ mapped2)
forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)
runMap :: Double
runMap = foldl' (+) 0.0 $ parmap forceEval [0.0..2000000.0]
main = do
putStrLn $ show runMap
I compiled it with ghc -prof -fprof-auto -rtsopts -threaded par.hs
I ran it with ./par +RTS -p -N?
where ?
is number of processors
I then looked at the generated par.prof
file.
Below are a couple of profiling outputs I got. I ran and profiled multiple times, so I'm pretty sure these numbers were not outliers.
Running with -N1
gave me:
Tue May 7 23:20 2019 Time and Allocation Profiling Report (Final)
par +RTS -N1 -p -RTS
total time = 1.20 secs (1200 ticks @ 1000 us, 1 processor)
total alloc = 1,936,132,144 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
forceEval Main par.hs:28:1-57 75.8 60.3
runMap Main par.hs:31:1-59 17.5 19.0
parmap' Main par.hs:(6,1)-(12,56) 3.2 14.9
parmap'.half1 Main par.hs:8:5-26 2.3 5.8
parmap'.half2 Main par.hs:9:5-26 1.1 0.0
Running with -N2
(note how the profile indicates a more than a 2x speedup?):
Tue May 7 23:24 2019 Time and Allocation Profiling Report (Final)
par +RTS -N2 -p -RTS
total time = 0.36 secs (725 ticks @ 1000 us, 2 processors)
total alloc = 1,936,149,368 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
forceEval Main par.hs:28:1-57 70.6 60.3
runMap Main par.hs:31:1-59 19.3 19.0
parmap' Main par.hs:(6,1)-(12,56) 4.3 14.9
parmap'.half1 Main par.hs:8:5-26 3.3 5.8
parmap'.half2 Main par.hs:9:5-26 1.7 0.0
Running with -N4
(note the slight more speedup compared to -N2
):
Tue May 7 23:25 2019 Time and Allocation Profiling Report (Final)
par +RTS -N4 -p -RTS
total time = 0.27 secs (1098 ticks @ 1000 us, 4 processors)
total alloc = 1,936,183,704 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
forceEval Main par.hs:28:1-57 71.5 60.3
runMap Main par.hs:31:1-59 19.3 19.0
parmap' Main par.hs:(6,1)-(12,56) 3.8 14.9
parmap'.half1 Main par.hs:8:5-26 3.6 5.8
parmap'.half2 Main par.hs:9:5-26 1.2 0.0
I expected to see some speedup when running on two processors, but no additional speedup if I run on more processors. But I couldn't visually observe any speedup at all, yet the GHC profiles above told me that there was a speed up - an unrealistically good one, plus additional speedup if I use more than two processors?
In fact in another program where I attempted to parallelize computation, I was compiling and profiling using stack, and observed such similar false speedup as well.
I'd really appreciate if someone could explain to me what went on: Is this the right way of writing parallel Haskell code? Why could running with 4 cores be helpful in runtime if I only had two parts to evaluate in parallel? Or did I just misinterpret the profiling output?
Thanks in advance for any help.
Upvotes: 1
Views: 81
Reputation: 20436
When par
compute it's first argument it just force top level node to compute, but it can still refer to some lazy computation.
If you change it to
import Control.Parallel
import Data.List
parmap f a l = let
halflen = floor $ (realToFrac $ length l) / 2
half1 = take halflen l
half2 = drop halflen l
mapped1 = map f half1
mapped2 = map f half2
agg1 = a mapped1
agg2 = a mapped2
in agg1 `par` agg2 `pseq` a [agg1, agg2]
forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)
runMap :: Double
runMap = parmap forceEval (foldl' (+) 0.0) [0.0..20000000.0]
main = do
putStrLn $ show runMap
You will see speedup when switch from one thread
time ./f +RTS -N1
-4615093.834471449
real 0m13.077s
user 0m12.333s
sys 0m0.744s
to two thread
time ./f +RTS -N2
-4615093.834471449
real 0m9.057s
user 0m14.512s
sys 0m2.170s
In my example agg1
and agg2
is primitive values and it is easier to force their computation.
Alternative
You can use rnf
from Control.DeepSeq
to force whole list to be computed.
import Control.Parallel
import Control.DeepSeq
import Data.List
parmap f l = let
halflen = floor $ (realToFrac $ length l) / 2
half1 = take halflen l
half2 = drop halflen l
mapped1 = map f half1
mapped2 = map f half2
in (rnf mapped1) `par` (rnf mapped2) `pseq` (mapped1 ++ mapped2)
forceEval x = sin x + cos x - tan x + sin(3*x) - cos(2*x)
runMap :: Double
runMap = foldl' (+) 0.0 $ parmap forceEval [0.0..20000000.0]
main = do
putStrLn $ show runMap
And see performance improvement.
One thread
time ./f2 +RTS -N1
-4615093.83447202
real 0m15.241s
user 0m14.261s
sys 0m0.980s
Two threads
time ./f2 +RTS -N2
-4615093.83447202
real 0m11.640s
user 0m17.092s
sys 0m3.178s
Upvotes: 1