Charles Cooper
Charles Cooper

Reputation: 283

Haskell: parallel program not utilizing all cores

The following code has the same performance whether compiled with -threaded or without, or when I write the code in a single threaded manner. Both blocks (using par and the commented forkIO/forkOS/forkOn) result in the same performance. In fact, performance is slightly degraded in the parallel version (presumably due to the overhead of parallel GC). Viewing the CPU utilization from a program like htop shows only one CPU getting pegged, which is pretty confusing since my reading of the code is that it should use most of the cores.

The fact that forkOS doesn't use more cores is particularly confusing since the relevant section from ghc/rts/posix/OSThreads.c:forkOS_createThread seems to imply that it forces a call to pthread_create.

-- (Apologies if I have missed an import or two)

import Data.List
import GHC.Conc
import Control.Concurrent
import Control.DeepSeq
import qualified Data.HashMap.Lazy as HM
main :: IO ()
main = do
  let [one::Int, two] = [15, 1000000]
{-
  s <- numSparks
  putStrLn $ "Num sparks " <> show s
  n <- getNumCapabilities
  putStrLn $ "Num capabilities " <> show n
  m <- newEmptyMVar
  forkIO $ void $ forM [(1::Int)..one] $ \cpu -> do
    -- forkOn cpu $ void $ do
    forkOS $ void $ do
    -- forkIO $ void $ do
    -- void $ do
      putStrLn $ "core " <> show cpu
      s <- return $ sort $ HM.keys $ HM.fromList $ zip [cpu..two + cpu] (repeat (0::Int))
      putStrLn $ "core " <> show cpu <> " done " <> show (sum s)
      putMVar m ()
  forM [1..one] $ \i -> takeMVar m
  let s :: String = "hey!"
  putStrLn s
-}
  print one
  print two
  let __pmap__ f xs = case xs of
       [] -> []
       x:xs -> let y = f x
                   ys = __pmap__ f xs
                   in (y `par` ys) `pseq` (y: ys)
  n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
    force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
  putStrLn $ "sum " <> show n
  s <- numSparks
  putStrLn $ "Num sparks " <> show s

Relevant section from my .cabal file

  ghc-options:
    -threaded
    -rtsopts
    "-with-rtsopts=-N15 -qg1"

Platform information

$ stack --version
Version 1.2.0, Git revision 241cd07d576d9c0c0e712e83d947e3dd64541c42 (4054 commits) x86_64 hpack-0.14.0
$ stack exec ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial
$ uname -r
4.4.0-36-generic

Why isn't my code getting parallelized?

EDIT: if it's helpful at all, adding the -s runtime flag produces the following report

  21,829,377,776 bytes allocated in the heap
 126,512,021,712 bytes copied during GC
      86,659,312 bytes maximum residency (322 sample(s))
       6,958,976 bytes maximum slop
             218 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     41944 colls,     0 par   16.268s  17.272s     0.0004s    0.0011s
  Gen  1       322 colls,   321 par   237.056s  23.822s     0.0740s    0.2514s

  Parallel GC work balance: 13.01% (serial 0%, perfect 100%)

  TASKS: 32 (1 bound, 31 peak workers (31 total), using -N15)

  SPARKS: 15 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 15 fizzled)

  INIT    time    0.004s  (  0.003s elapsed)
  MUT     time   12.504s  ( 13.301s elapsed)
  GC      time  253.324s  ( 41.094s elapsed)
  EXIT    time    0.000s  (  0.017s elapsed)
  Total   time  265.920s  ( 54.413s elapsed)

  Alloc rate    1,745,791,568 bytes per MUT second

  Productivity   4.7% of total user, 23.1% of total elapsed

gc_alloc_block_sync: 10725286
whitehole_spin: 0
gen[0].sync: 2171
gen[1].sync: 1057315

EDIT2: Messing with the arena size seems to have helped considerably. I added -H2G -A1G to the RTS options and the time came down from 43s to 5.2s. Is there anything else that can be improved about the situation to get a full 15x speedup?

EDIT3: Edited the code to reflect the par, pseq pattern suggested by two people giving feedback

Upvotes: 3

Views: 655

Answers (1)

redneb
redneb

Reputation: 23850

The issue is caused by the definition of __pmap__. Specifically there is a problem in the following expression:

let y = f x
 in y `par` (y: __pmap__ f xs)

You would expect that this would cause y and y: __pmap__ f xs to be evaluated in parallel, but this is not the case. What happens is that GHC tries to evaluate them in parallel, but the second subexpression contains y, which is the first subexpression. Because of that, the second subexpression depends on the first one and thus they cannot be evaluated in parallel. The correct way to write the above expression is

let y = f x
    ys = __pmap__ f xs
 in y `par` (ys `pseq` (y : ys))

because the pseq will force ys to be evaluated before y : ys and thus the evaluation of the second subexpression can be started while the evaluation of y is running. See also this thread for some discussion on this.

So putting it all together, we get the following:

main :: IO ()
main = do
  let [one::Int, two] = [15, 1000000]
  print one
  print two
  let __pmap__ f xs = case xs of
        [] -> []
        x:xs -> let y = f x
                    ys = __pmap__ f xs
                 in y `par` ys `pseq` (y : ys)
  n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
    traceShow i $ force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
  putStrLn $ "sum " <> show n
  s <- numSparks
  putStrLn $ "Num sparks " <> show s

Notice that I've added a traceShow (from Debug.Trace). If you run this with -N1 in rtsopts you will see that the list will be evaluated one element at a time, whereas if you use -N3, it will be evaluated 3 elements at a time.

The moral of the story is that par and pseq are easy to misuse and you should therefore prefer higher level solutions such as parMap rdeepseq (which is equivalent to your __pmap__) from parallel.

Upvotes: 1

Related Questions