Reputation: 4326
The code below is just an experiment to see what happens when forkIO
is called (to test GHC lightweight thread overhead - both memory overhead and effect of contention for MVar) with a function that just reads and writes MVar
. I am aware there are going to be access contention when there are 1 million threads waiting for MVar access - but since each function accesses it only once, all of them should run through eventually. What I am seeing in my tests is that if I compile the code with -threaded
option (GHC 7.0.3, Mac, x86_64), the code slows down a lot - by a factor of ~110. I will appreciate pointers into the cause of slowdown in the code with -threaded
option.
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, MVar, newEmptyMVar, putMVar, takeMVar)
import qualified Data.Vector.Storable.Mutable as MSV
import Control.Monad.Primitive (PrimState)
import Control.Monad (mapM_)
f :: MVar Int -> IO ()
f m = do
!i <- takeMVar m
putMVar m (i+1)
main = do
m <- newEmptyMVar
putMVar m 0
let l = 1000000
mapM_ (\x -> forkIO $ f m) [1..l]
When compiled with ghc -O2 -rtsopts
:
./test +RTS -s
1,070,652,216 bytes allocated in the heap
1,023,908,744 bytes copied during GC
1,872 bytes maximum residency (1 sample(s))
177,328 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2029 collections, 0 parallel, 0.13s, 0.13s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.08s ( 0.08s elapsed)
GC time 0.13s ( 0.13s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.21s ( 0.22s elapsed)
%GC time 61.9% (61.7% elapsed)
Alloc rate 13,110,134,156 bytes per MUT second
Productivity 37.9% of total user, 37.2% of total elapsed
When compiled with ghc -O2 -rtsopts -threaded
(I used -N3
for quad-core imac):
./test +RTS -s -N3
1,096,608,080 bytes allocated in the heap
2,713,129,232 bytes copied during GC
761,160,288 bytes maximum residency (10 sample(s))
711,798,176 bytes maximum slop
2424 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 1177 collections, 1176 parallel, 46.51s, 15.93s elapsed
Generation 1: 10 collections, 10 parallel, 9.35s, 5.60s elapsed
Parallel GC work balance: 1.05 (339027672 / 323162843, ideal 3)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s)
Task 1 (worker) : 56.95s ( 1.49s) 0.26s ( 0.09s)
Task 2 (worker) : 57.05s ( 1.49s) 0.16s ( 0.05s)
Task 3 (bound) : 1.49s ( 1.23s) 55.27s ( 21.33s)
Task 4 (worker) : 57.20s ( 1.49s) 0.00s ( 0.00s)
Task 5 (worker) : 57.03s ( 1.49s) 0.18s ( 0.06s)
SPARKS: 0 (0 converted, 0 pruned)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.90s ( 1.23s elapsed)
GC time 55.86s ( 21.53s elapsed)
EXIT time 0.45s ( 0.40s elapsed)
Total time 57.21s ( 23.02s elapsed)
%GC time 97.6% (93.5% elapsed)
Alloc rate 811,808,581 bytes per MUT second
Productivity 2.4% of total user, 5.9% of total elapsed
gc_alloc_block_sync: 19789
whitehole_spin: 73
gen[0].sync_large_objects: 0
gen[1].sync_large_objects: 0
As you can see above, the total memory in use goes up from ~9MB to ~2GB in -threaded
mode. Total memory allocated on the heap in both cases is within ~1%. I suspect most of the resident memory overhead comes from forking the function since each function instance on heap has to be a thunk. Since I haven't used any of the par strategies, there are no sparks.
I wrote this code because I am curious about the behavior with and without -threaded
mode, not because I plan to use it that way. Basically, I am teaching myself what happens if you write a bad code like that. So, you don't have to say don't write that way :)
EDIT 1:
ehird pointed out in the comment to check that threaded
runtime itself isn't contributing to slowdown. Threaded runtime doesn't seem to contribute to slowdown - compiling with -threaded
option doesn't change the performance. The slowdown happens only when +RTS -N
option is used at runtime.
Upvotes: 2
Views: 392
Reputation: 8153
The test program is NOT doing the work you expect. Once the main
thread is done forking it is finished and the whole program exits.
Without the threaded runtime this happens quite quickly. With the threads the runtime may switch and perform some of the work, but not all of the work before the main
thread finishes.
You need to redesign the test if you want to make this microbenchmark work:
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, MVar, newEmptyMVar, putMVar, takeMVar, threadDelay)
import Control.Monad (mapM_)
l = 100000
f :: MVar Int -> IO ()
f m = do
!i <- takeMVar m
putMVar m (i+1)
if i==l then print "done" else return ()
main = do
m <- newEmptyMVar
putMVar m 1
mapM_ (\x -> forkIO $ f m) [1..l]
-- threadDelay (1000*1000)
I changed the l
value a bit and started m
at 1. It will only print "done" when the main thread delays longs enough to let the last thread do the last increment of i
.
If you change print "done"
into a program exit command then you can time the execution.
Upvotes: 4