Haskell: parallel program not utilizing all cores

Question

The following code has the same performance whether compiled with -threaded or without, or when I write the code in a single threaded manner. Both blocks (using par and the commented forkIO/forkOS/forkOn) result in the same performance. In fact, performance is slightly degraded in the parallel version (presumably due to the overhead of parallel GC). Viewing the CPU utilization from a program like htop shows only one CPU getting pegged, which is pretty confusing since my reading of the code is that it should use most of the cores.

The fact that forkOS doesn't use more cores is particularly confusing since the relevant section from ghc/rts/posix/OSThreads.c:forkOS_createThread seems to imply that it forces a call to pthread_create.

-- (Apologies if I have missed an import or two)

import Data.List
import GHC.Conc
import Control.Concurrent
import Control.DeepSeq
import qualified Data.HashMap.Lazy as HM
main :: IO ()
main = do
  let [one::Int, two] = [15, 1000000]
{-
  s <- numSparks
  putStrLn $ "Num sparks " <> show s
  n <- getNumCapabilities
  putStrLn $ "Num capabilities " <> show n
  m <- newEmptyMVar
  forkIO $ void $ forM [(1::Int)..one] $ \cpu -> do
    -- forkOn cpu $ void $ do
    forkOS $ void $ do
    -- forkIO $ void $ do
    -- void $ do
      putStrLn $ "core " <> show cpu
      s <- return $ sort $ HM.keys $ HM.fromList $ zip [cpu..two + cpu] (repeat (0::Int))
      putStrLn $ "core " <> show cpu <> " done " <> show (sum s)
      putMVar m ()
  forM [1..one] $ \i -> takeMVar m
  let s :: String = "hey!"
  putStrLn s
-}
  print one
  print two
  let __pmap__ f xs = case xs of
       [] -> []
       x:xs -> let y = f x
                   ys = __pmap__ f xs
                   in (y `par` ys) `pseq` (y: ys)
  n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
    force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
  putStrLn $ "sum " <> show n
  s <- numSparks
  putStrLn $ "Num sparks " <> show s

Relevant section from my .cabal file

  ghc-options:
    -threaded
    -rtsopts
    "-with-rtsopts=-N15 -qg1"

Platform information

$ stack --version
Version 1.2.0, Git revision 241cd07d576d9c0c0e712e83d947e3dd64541c42 (4054 commits) x86_64 hpack-0.14.0
$ stack exec ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial
$ uname -r
4.4.0-36-generic

Why isn't my code getting parallelized?

EDIT: if it's helpful at all, adding the -s runtime flag produces the following report

  21,829,377,776 bytes allocated in the heap
 126,512,021,712 bytes copied during GC
      86,659,312 bytes maximum residency (322 sample(s))
       6,958,976 bytes maximum slop
             218 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     41944 colls,     0 par   16.268s  17.272s     0.0004s    0.0011s
  Gen  1       322 colls,   321 par   237.056s  23.822s     0.0740s    0.2514s

  Parallel GC work balance: 13.01% (serial 0%, perfect 100%)

  TASKS: 32 (1 bound, 31 peak workers (31 total), using -N15)

  SPARKS: 15 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 15 fizzled)

  INIT    time    0.004s  (  0.003s elapsed)
  MUT     time   12.504s  ( 13.301s elapsed)
  GC      time  253.324s  ( 41.094s elapsed)
  EXIT    time    0.000s  (  0.017s elapsed)
  Total   time  265.920s  ( 54.413s elapsed)

  Alloc rate    1,745,791,568 bytes per MUT second

  Productivity   4.7% of total user, 23.1% of total elapsed

gc_alloc_block_sync: 10725286
whitehole_spin: 0
gen[0].sync: 2171
gen[1].sync: 1057315

EDIT2: Messing with the arena size seems to have helped considerably. I added -H2G -A1G to the RTS options and the time came down from 43s to 5.2s. Is there anything else that can be improved about the situation to get a full 15x speedup?

EDIT3: Edited the code to reflect the par, pseq pattern suggested by two people giving feedback

Haskell: parallel program not utilizing all cores

Answers (1)

Related Questions