Michael Fox
Michael Fox

Reputation: 3642

Why does `-threaded` make it slower?

A simple plan:

import qualified Data.ByteString.Lazy.Char8 as BS

main = do
  wc <- length . BS.words <$> BS.getContents
  print wc

Build for speed:

ghc -fllvm -O2 -threaded -rtsopts Words.hs

More CPUs means more slowly?

$ time ./Words +RTS -qa -N1 < big.txt
331041862

real  0m25.963s
user  0m21.747s
sys 0m1.528s
$ time ./Words +RTS -qa -N2 < big.txt
331041862

real  0m36.410s
user  0m34.910s
sys 0m6.892s
$ time ./Words +RTS -qa -N4 < big.txt
331041862

real  0m42.150s
user  0m55.393s
sys 0m16.227s

For good measure:

$time wc -w big.txt
 331041862 big.txt

real    0m8.277s
user    0m7.553s
sys 0m0.529s

Clearly, this is a single-threaded activity. Still, I wonder why it slows down so much.

Also, do you have any tips, how I can make it competitive with wc?

Upvotes: 3

Views: 221

Answers (1)

zakyggaps
zakyggaps

Reputation: 3080

It's GC. Executed your program with +RTS -s and the results told everything.

-N1

D:\>a +RTS -qa -N1 -s < lorem.txt
15470835
   4,558,095,152 bytes allocated in the heap
       1,746,720 bytes copied during GC
          77,936 bytes maximum residency (118 sample(s))
         131,856 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      8519 colls,     0 par    0.016s   0.021s     0.0000s    0.0001s
  Gen  1       118 colls,     0 par    0.000s   0.004s     0.0000s    0.0001s

  TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.842s  (  0.855s elapsed)
  GC      time    0.016s  (  0.025s elapsed)
  EXIT    time    0.016s  (  0.000s elapsed)
  Total   time    0.874s  (  0.881s elapsed)

  Alloc rate    5,410,809,512 bytes per MUT second

  Productivity  98.2% of total user, 97.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

-N4

D:\>a +RTS -qa -N4 -s < lorem.txt
15470835
   4,558,093,352 bytes allocated in the heap
       1,720,232 bytes copied during GC
          77,936 bytes maximum residency (113 sample(s))
         160,432 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      8524 colls,  8524 par    4.742s   1.678s     0.0002s    0.0499s
  Gen  1       113 colls,   112 par    0.031s   0.027s     0.0002s    0.0099s

  Parallel GC work balance: 1.40% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    1.950s  (  1.415s elapsed)
  GC      time    4.774s  (  1.705s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    6.724s  (  3.121s elapsed)

  Alloc rate    2,337,468,786 bytes per MUT second

  Productivity  29.0% of total user, 62.5% of total elapsed

gc_alloc_block_sync: 21082
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

The most significant parts are

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0      8524 colls,  8524 par    4.742s   1.678s     0.0002s    0.0499s

and

Parallel GC work balance: 1.40% (serial 0%, perfect 100%)

When -threaded switch is on, at runtime ghc will try its best to balance any work among threads as far as possible. Your whole program is a sequential process so the only work can be moved to other threads are GC, while your program in fact cannot be GCed in parallel so these threads wait for one another to complete their job, resulting a lot of time wasted on synchronization.

If you tell the runtime not to balance among threads by +RTS -qm then sometimes -N4 is as fast as -N1.

Upvotes: 8

Related Questions