Why does this Haskell program perform strange when compiled with -threaded?

Question

Consider the following toy program that brute-forces a password hash by applying character substitutions to dictionary words. The dictionary is traversed sequentially or in parallel, triggered at compile time by the PARMAP symbol.

import qualified Control.Parallel.Strategies as Strat
import qualified Crypto.Hash.SHA256 as SHA256
import qualified Data.ByteString as BS
import qualified Data.ByteString.Base16 as BS.Base16
import qualified Data.ByteString.Char8 as BS.Char8
import Data.Char (isLower, toUpper)
import Data.Maybe (listToMaybe)

variants :: String -> [String]
variants "" = [""]
variants (c:s) = [c':s' | s' <- variants s, c' <- subst c]
  where subst 'a' = "aA@"
        subst 'e' = "eE3"
        subst 'i' = "iI1"
        subst 'l' = "lL1"
        subst 'o' = "oO0"
        subst 's' = "sS$5"
        subst 'z' = "zZ2"
        subst x | isLower x = [x, toUpper x]
        subst x = [x]

myMap :: (a -> [a]) -> [a] -> [[a]]
#ifdef PARMAP
myMap = Strat.parMap (Strat.evalList Strat.rseq)
#else
myMap = map
#endif

bruteForce :: [String] -> BS.ByteString -> Maybe String
bruteForce dictionary hash = listToMaybe $ concat $ myMap match dictionary
  where match word = [var | var <- variants word,
                      SHA256.hash (BS.Char8.pack var) == hash]

main :: IO ()
main = do
  dictionary <- lines `fmap` (readFile "dictionary.txt")
  hash <- (fst . BS.Base16.decode . BS.Char8.pack) `fmap` getLine
  case bruteForce dictionary hash of
    Just preimage -> putStrLn preimage
    Nothing -> return ()

I compile this program with and without both PARMAP and -threaded.

$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -o brute.seq
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -DPARMAP -o brute.par
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -o brute.seq+th
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -DPARMAP -o brute.par+th

To run this program, I make a small dictionary and take the last word from it.

$ shuf -n 300 /usr/share/dict/american-english >dictionary.txt
$ tail -n 1 dictionary.txt 
desalinates
$ echo -n 'De$aL1n@teS' | sha256sum
3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072  -

I run this on a 2-core CPU. No other CPU-intensive processes are running on this machine.

The sequential map version performs as expected.

$ TIMEFORMAT='real %R   user %U   sys %S'
$ time ./brute.seq <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.797   user 39.574   sys 0.156
$ time ./brute.seq+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.313   user 44.159   sys 0.088
$ time ./brute.seq+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.990   user 44.835   sys 0.876

The parallel map version, compiled without -threaded, has the same performance.

$ time ./brute.par <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.847   user 39.742   sys 0.056

When I combine parallel map with -threaded, but don’t use 2 cores just yet, things start looking strange.

$ time ./brute.par+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 58.636   user 73.661   sys 17.717

When I use 2 cores, things get stranger yet. Performance now varies wildly from run to run, something the previous versions do not exhibit. Sometimes it is twice as fast as single-core brute.par+th, which is what I expect, seeing as the algorithm is embarrasingly parallel. But sometimes it is even slower than on one core.

$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 28.589   user 51.615   sys 2.304
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 59.149   user 106.255   sys 4.664
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 49.428   user 87.193   sys 3.816

Now I have two questions which may be related.

Why is 1-core brute.par+th so much slower than 1-core brute.par? What’s it doing in those threads? What’s it doing in kernel mode for a whopping 17 seconds?
Why does the performance of 2-core brute.par+th vary so wildly, instead of being reliably twice the performance of 1-core brute.par+th?

I’m using GHC 7.4.1 and parallel-3.2.0.2. I know I should probably be using newer versions, but this is what I have handy at the moment.

I tried compiling with -rtsopts and disabling threaded GC with +RTS -qg, to no effect.

I tried ThreadScope, but it swapped like crazy and couldn’t load an eventlog even when I used a much smaller dictionary.

SimpleGuy · Accepted Answer

The unexpected performance could be explained by the fact that "Crypto.Hash.SHA256" invokes unsafe FFI code. GHC does not guarantee that other Haskell threads will not be blocked during the call to this code. If the threads that par spawns are being blocked by GHC it would cause a lot of contention in your program, which leads to the long run times and inconsistent run time results.

Why does this Haskell program perform strange when compiled with -threaded?

Answers (1)

Related Questions