Memory footprint in threaded mode with forkIO when there is lot of contention for single MVar

Question

The code below is just an experiment to see what happens when forkIO is called (to test GHC lightweight thread overhead - both memory overhead and effect of contention for MVar) with a function that just reads and writes MVar. I am aware there are going to be access contention when there are 1 million threads waiting for MVar access - but since each function accesses it only once, all of them should run through eventually. What I am seeing in my tests is that if I compile the code with -threaded option (GHC 7.0.3, Mac, x86_64), the code slows down a lot - by a factor of ~110. I will appreciate pointers into the cause of slowdown in the code with -threaded option.

{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, MVar, newEmptyMVar, putMVar, takeMVar)
import qualified Data.Vector.Storable.Mutable as MSV
import Control.Monad.Primitive (PrimState)
import Control.Monad (mapM_)

f ::  MVar Int -> IO ()
f m  = do
            !i <- takeMVar m
            putMVar m (i+1)

main = do
  m <- newEmptyMVar
  putMVar m 0
  let l = 1000000
  mapM_ (\x -> forkIO $ f m) [1..l]

When compiled with ghc -O2 -rtsopts:

 ./test +RTS -s 
   1,070,652,216 bytes allocated in the heap
   1,023,908,744 bytes copied during GC
           1,872 bytes maximum residency (1 sample(s))
         177,328 bytes maximum slop
               9 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  2029 collections,     0 parallel,  0.13s,  0.13s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.08s  (  0.08s elapsed)
  GC    time    0.13s  (  0.13s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.21s  (  0.22s elapsed)

  %GC time      61.9%  (61.7% elapsed)

  Alloc rate    13,110,134,156 bytes per MUT second

  Productivity  37.9% of total user, 37.2% of total elapsed

When compiled with ghc -O2 -rtsopts -threaded (I used -N3 for quad-core imac):

./test +RTS -s -N3 
   1,096,608,080 bytes allocated in the heap
   2,713,129,232 bytes copied during GC
     761,160,288 bytes maximum residency (10 sample(s))
     711,798,176 bytes maximum slop
            2424 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  1177 collections,  1176 parallel, 46.51s, 15.93s elapsed
  Generation 1:    10 collections,    10 parallel,  9.35s,  5.60s elapsed

  Parallel GC work balance: 1.05 (339027672 / 323162843, ideal 3)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :   56.95s    (  1.49s)       0.26s    (  0.09s)
  Task  2 (worker) :   57.05s    (  1.49s)       0.16s    (  0.05s)
  Task  3 (bound)  :    1.49s    (  1.23s)      55.27s    ( 21.33s)
  Task  4 (worker) :   57.20s    (  1.49s)       0.00s    (  0.00s)
  Task  5 (worker) :   57.03s    (  1.49s)       0.18s    (  0.06s)

  SPARKS: 0 (0 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.90s  (  1.23s elapsed)
  GC    time   55.86s  ( 21.53s elapsed)
  EXIT  time    0.45s  (  0.40s elapsed)
  Total time   57.21s  ( 23.02s elapsed)

  %GC time      97.6%  (93.5% elapsed)

  Alloc rate    811,808,581 bytes per MUT second

  Productivity   2.4% of total user, 5.9% of total elapsed

gc_alloc_block_sync: 19789
whitehole_spin: 73
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0

As you can see above, the total memory in use goes up from ~9MB to ~2GB in -threaded mode. Total memory allocated on the heap in both cases is within ~1%. I suspect most of the resident memory overhead comes from forking the function since each function instance on heap has to be a thunk. Since I haven't used any of the par strategies, there are no sparks.

I wrote this code because I am curious about the behavior with and without -threaded mode, not because I plan to use it that way. Basically, I am teaching myself what happens if you write a bad code like that. So, you don't have to say don't write that way :)

EDIT 1: ehird pointed out in the comment to check that threaded runtime itself isn't contributing to slowdown. Threaded runtime doesn't seem to contribute to slowdown - compiling with -threaded option doesn't change the performance. The slowdown happens only when +RTS -N option is used at runtime.

Chris Kuklewicz · Accepted Answer

The test program is NOT doing the work you expect. Once the main thread is done forking it is finished and the whole program exits.

Without the threaded runtime this happens quite quickly. With the threads the runtime may switch and perform some of the work, but not all of the work before the main thread finishes.

You need to redesign the test if you want to make this microbenchmark work:

{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, MVar, newEmptyMVar, putMVar, takeMVar, threadDelay)
import Control.Monad (mapM_)

l = 100000

f ::  MVar Int -> IO ()
f m  = do
            !i <- takeMVar m
            putMVar m (i+1)
            if i==l then print "done" else return ()


main = do
  m <- newEmptyMVar
  putMVar m 1
  mapM_ (\x -> forkIO $ f m) [1..l]
--  threadDelay (1000*1000)

I changed the l value a bit and started m at 1. It will only print "done" when the main thread delays longs enough to let the last thread do the last increment of i.

If you change print "done" into a program exit command then you can time the execution.

Memory footprint in threaded mode with forkIO when there is lot of contention for single MVar

Answers (1)

Related Questions