Reputation: 283
The following code has the same performance whether compiled with -threaded
or without, or when I write the code in a single threaded manner. Both blocks (using par
and the commented forkIO/forkOS/forkOn
) result in the same performance. In fact, performance is slightly degraded in the parallel version (presumably due to the overhead of parallel GC). Viewing the CPU utilization from a program like htop
shows only one CPU getting pegged, which is pretty confusing since my reading of the code is that it should use most of the cores.
The fact that forkOS
doesn't use more cores is particularly confusing since the relevant section from ghc/rts/posix/OSThreads.c:forkOS_createThread
seems to imply that it forces a call to pthread_create
.
-- (Apologies if I have missed an import or two)
import Data.List
import GHC.Conc
import Control.Concurrent
import Control.DeepSeq
import qualified Data.HashMap.Lazy as HM
main :: IO ()
main = do
let [one::Int, two] = [15, 1000000]
{-
s <- numSparks
putStrLn $ "Num sparks " <> show s
n <- getNumCapabilities
putStrLn $ "Num capabilities " <> show n
m <- newEmptyMVar
forkIO $ void $ forM [(1::Int)..one] $ \cpu -> do
-- forkOn cpu $ void $ do
forkOS $ void $ do
-- forkIO $ void $ do
-- void $ do
putStrLn $ "core " <> show cpu
s <- return $ sort $ HM.keys $ HM.fromList $ zip [cpu..two + cpu] (repeat (0::Int))
putStrLn $ "core " <> show cpu <> " done " <> show (sum s)
putMVar m ()
forM [1..one] $ \i -> takeMVar m
let s :: String = "hey!"
putStrLn s
-}
print one
print two
let __pmap__ f xs = case xs of
[] -> []
x:xs -> let y = f x
ys = __pmap__ f xs
in (y `par` ys) `pseq` (y: ys)
n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
putStrLn $ "sum " <> show n
s <- numSparks
putStrLn $ "Num sparks " <> show s
Relevant section from my .cabal
file
ghc-options:
-threaded
-rtsopts
"-with-rtsopts=-N15 -qg1"
Platform information
$ stack --version
Version 1.2.0, Git revision 241cd07d576d9c0c0e712e83d947e3dd64541c42 (4054 commits) x86_64 hpack-0.14.0
$ stack exec ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$ uname -r
4.4.0-36-generic
Why isn't my code getting parallelized?
EDIT: if it's helpful at all, adding the -s
runtime flag produces the following report
21,829,377,776 bytes allocated in the heap
126,512,021,712 bytes copied during GC
86,659,312 bytes maximum residency (322 sample(s))
6,958,976 bytes maximum slop
218 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 41944 colls, 0 par 16.268s 17.272s 0.0004s 0.0011s
Gen 1 322 colls, 321 par 237.056s 23.822s 0.0740s 0.2514s
Parallel GC work balance: 13.01% (serial 0%, perfect 100%)
TASKS: 32 (1 bound, 31 peak workers (31 total), using -N15)
SPARKS: 15 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 15 fizzled)
INIT time 0.004s ( 0.003s elapsed)
MUT time 12.504s ( 13.301s elapsed)
GC time 253.324s ( 41.094s elapsed)
EXIT time 0.000s ( 0.017s elapsed)
Total time 265.920s ( 54.413s elapsed)
Alloc rate 1,745,791,568 bytes per MUT second
Productivity 4.7% of total user, 23.1% of total elapsed
gc_alloc_block_sync: 10725286
whitehole_spin: 0
gen[0].sync: 2171
gen[1].sync: 1057315
EDIT2: Messing with the arena size seems to have helped considerably. I added -H2G -A1G
to the RTS options and the time came down from 43s to 5.2s. Is there anything else that can be improved about the situation to get a full 15x speedup?
EDIT3: Edited the code to reflect the par
, pseq
pattern suggested by two people giving feedback
Upvotes: 3
Views: 655
Reputation: 23850
The issue is caused by the definition of __pmap__
. Specifically there is a problem in the following expression:
let y = f x
in y `par` (y: __pmap__ f xs)
You would expect that this would cause y
and y: __pmap__ f xs
to be evaluated in parallel, but this is not the case. What happens is that GHC tries to evaluate them in parallel, but the second subexpression contains y
, which is the first subexpression. Because of that, the second subexpression depends on the first one and thus they cannot be evaluated in parallel. The correct way to write the above expression is
let y = f x
ys = __pmap__ f xs
in y `par` (ys `pseq` (y : ys))
because the pseq
will force ys
to be evaluated before y : ys
and thus the evaluation of the second subexpression can be started while the evaluation of y
is running. See also this thread for some discussion on this.
So putting it all together, we get the following:
main :: IO ()
main = do
let [one::Int, two] = [15, 1000000]
print one
print two
let __pmap__ f xs = case xs of
[] -> []
x:xs -> let y = f x
ys = __pmap__ f xs
in y `par` ys `pseq` (y : ys)
n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
traceShow i $ force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
putStrLn $ "sum " <> show n
s <- numSparks
putStrLn $ "Num sparks " <> show s
Notice that I've added a traceShow
(from Debug.Trace
). If you run this with -N1
in rtsopts
you will see that the list will be evaluated one element at a time, whereas if you use -N3
, it will be evaluated 3 elements at a time.
The moral of the story is that par
and pseq
are easy to misuse and you should therefore prefer higher level solutions such as parMap rdeepseq
(which is equivalent to your __pmap__
) from parallel.
Upvotes: 1