Carl
Carl

Reputation: 27023

Excessive mysterious system time use in a GHC-compiled binary

I'm working on an exploration of automatic bounding of constraint-base searches. As such, my starting point is the SEND MORE MONEY problem, with a solution based on nondeterministic selection without replacement. I've modified the approach to count the number of samples performed, in order to better measure the impact of adding constraints to the search.

import Control.Monad.State
import Control.Monad.Trans.List
import Control.Monad.Morph
import Data.List (foldl')

type CS a b = StateT [a] (ListT (State Int)) b

select' :: [a] -> [(a, [a])]
select' [] = []
select' (x:xs) = (x, xs) : [(y, x:ys) | ~(y, ys) <- select' xs]

select :: CS a a
select = do
    i <- lift . lift $ get
    xs <- get
    lift . lift . put $! i + length xs
    hoist (ListT . return) (StateT select')

runCS :: CS a b -> [a] -> ([b], Int)
runCS a xs = flip runState 0 . runListT $ evalStateT a xs

fromDigits :: [Int] -> Int
fromDigits = foldl' (\x y -> 10 * x + y) 0

sendMoreMoney :: ([(Int, Int, Int)], Int)
sendMoreMoney = flip runCS [0..9] $ do
    [s,e,n,d,m,o,r,y] <- replicateM 8 select
    let send  = fromDigits [s,e,n,d]
        more  = fromDigits [m,o,r,e]
        money = fromDigits [m,o,n,e,y]
    guard $ s /= 0 && m /= 0 && send + more == money
    return (send, more, money)

main :: IO ()
main = print sendMoreMoney

It works, it gets correct results, and it maintains a flat heap profile during the search. But even so, it's slow. It's something like 20x slower than without counting the selections. Even that isn't terrible. I can live with paying a huge penalty in order to collect these performance numbers.

But I still don't want the performance to be needlessly terrible, so I decided to look for low-hanging fruit in terms of performance. And I came across some baffling results when I did.

$ ghc -O2 -Wall -fforce-recomp -rtsopts statefulbacktrack.hs
[1 of 1] Compiling Main             ( statefulbacktrack.hs, statefulbacktrack.o )
Linking statefulbacktrack ...
$ time ./statefulbacktrack
([(9567,1085,10652)],2606500)

real    0m6.960s
user    0m3.880s
sys     0m2.968s

That system time is utterly ridiculous. The program performs output once. Where's it all going? My next step was checking strace.

$ strace -cf ./statefulbacktrack
([(9567,1085,10652)],2606500)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.38    0.033798        1469        23           munmap
  1.08    0.000370           0     21273           rt_sigprocmask
  0.26    0.000090           0     10638           clock_gettime
  0.21    0.000073           0     10638           getrusage
  0.07    0.000023           4         6           mprotect
  0.00    0.000000           0         8           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0       144       134 open
  0.00    0.000000           0        10           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         9         9 access
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0       847           sigreturn
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           select
  0.00    0.000000           0        13           rt_sigaction
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0       387           mmap2
  0.00    0.000000           0        16        15 stat64
  0.00    0.000000           0        10           fstat64
  0.00    0.000000           0         1         1 futex
  0.00    0.000000           0         1           set_thread_area
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           timer_create
  0.00    0.000000           0         2           timer_settime
  0.00    0.000000           0         1           timer_delete
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.034354                 44039       159 total

So.. strace tells me only 0.034354s was spent in system calls.

Where's the rest of the sys time reported by time going?

One further data point: GC time is really high. Is there an easy way to bring that down?

$ ./statefulbacktrack +RTS -s
([(9567,1085,10652)],2606500)
   5,541,572,660 bytes allocated in the heap
   1,465,208,164 bytes copied during GC
      27,317,868 bytes maximum residency (66 sample(s))
         635,056 bytes maximum slop
              65 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10568 colls,     0 par    1.924s   2.658s     0.0003s    0.0081s
  Gen  1        66 colls,     0 par    0.696s   2.226s     0.0337s    0.1059s

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    1.656s  (  2.279s elapsed)
  GC      time    2.620s  (  4.884s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time    4.276s  (  7.172s elapsed)

  %GC     time      61.3%  (68.1% elapsed)

  Alloc rate    3,346,131,972 bytes per MUT second

  Productivity  38.7% of total user, 23.1% of total elapsed

System Info:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$ uname -a
Linux debian 3.2.0-4-686-pae #1 SMP Debian 3.2.68-1+deb7u1 i686 GNU/Linux

Running a Debian 7 virtual machine in VMWare Player 7.10 hosted on Windows 8.1.

Upvotes: 33

Views: 423

Answers (1)

miniscule
miniscule

Reputation: 730

Be sure to add -H128 to your build command line after

+RTS -s

Your eval looks fine, so you are good to go there.

If you really wanted to go after sluggishness on this VM, increase the thread priority on the VM (and the VM console slightly if you want).

Another unexpected penalty will be due to sync confirmation for GC (since this is SMP Debian on a multicore system).

The GC will have even more VM manipulation to perform on any multicore system, which partially explains the 61 percent GC stat and your strace and time discrepancy. The stats are not reliable for most situations anyway

You are actually doing quite well - - especially if this is on an i7 or later, for example.

I would be surprised if the -H128 option does not resolve this.

I am new here, please do let me know if I can help further or if there is anything you require prior to doling out the bounty.

Upvotes: 5

Related Questions