Reputation: 12310
Consider the following toy program that brute-forces a password hash by applying character substitutions to dictionary words. The dictionary is traversed sequentially or in parallel, triggered at compile time by the PARMAP
symbol.
import qualified Control.Parallel.Strategies as Strat
import qualified Crypto.Hash.SHA256 as SHA256
import qualified Data.ByteString as BS
import qualified Data.ByteString.Base16 as BS.Base16
import qualified Data.ByteString.Char8 as BS.Char8
import Data.Char (isLower, toUpper)
import Data.Maybe (listToMaybe)
variants :: String -> [String]
variants "" = [""]
variants (c:s) = [c':s' | s' <- variants s, c' <- subst c]
where subst 'a' = "aA@"
subst 'e' = "eE3"
subst 'i' = "iI1"
subst 'l' = "lL1"
subst 'o' = "oO0"
subst 's' = "sS$5"
subst 'z' = "zZ2"
subst x | isLower x = [x, toUpper x]
subst x = [x]
myMap :: (a -> [a]) -> [a] -> [[a]]
#ifdef PARMAP
myMap = Strat.parMap (Strat.evalList Strat.rseq)
#else
myMap = map
#endif
bruteForce :: [String] -> BS.ByteString -> Maybe String
bruteForce dictionary hash = listToMaybe $ concat $ myMap match dictionary
where match word = [var | var <- variants word,
SHA256.hash (BS.Char8.pack var) == hash]
main :: IO ()
main = do
dictionary <- lines `fmap` (readFile "dictionary.txt")
hash <- (fst . BS.Base16.decode . BS.Char8.pack) `fmap` getLine
case bruteForce dictionary hash of
Just preimage -> putStrLn preimage
Nothing -> return ()
I compile this program with and without both PARMAP
and -threaded
.
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -o brute.seq
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -DPARMAP -o brute.par
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -o brute.seq+th
$ ghc Brute.hs -cpp -fforce-recomp -Wall -O2 -threaded -DPARMAP -o brute.par+th
To run this program, I make a small dictionary and take the last word from it.
$ shuf -n 300 /usr/share/dict/american-english >dictionary.txt
$ tail -n 1 dictionary.txt
desalinates
$ echo -n 'De$aL1n@teS' | sha256sum
3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072 -
I run this on a 2-core CPU. No other CPU-intensive processes are running on this machine.
The sequential map version performs as expected.
$ TIMEFORMAT='real %R user %U sys %S'
$ time ./brute.seq <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.797 user 39.574 sys 0.156
$ time ./brute.seq+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.313 user 44.159 sys 0.088
$ time ./brute.seq+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 44.990 user 44.835 sys 0.876
The parallel map version, compiled without -threaded
, has the same performance.
$ time ./brute.par <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 39.847 user 39.742 sys 0.056
When I combine parallel map with -threaded
, but don’t use 2 cores just yet, things start looking strange.
$ time ./brute.par+th <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 58.636 user 73.661 sys 17.717
When I use 2 cores, things get stranger yet. Performance now varies wildly from run to run, something the previous versions do not exhibit. Sometimes it is twice as fast as single-core brute.par+th
, which is what I expect, seeing as the algorithm is embarrasingly parallel. But sometimes it is even slower than on one core.
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 28.589 user 51.615 sys 2.304
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 59.149 user 106.255 sys 4.664
$ time ./brute.par+th +RTS -N <<<3c76d2487f8094a8bf4cbb9e7de7624fab48c3d4f82112cb150b35b9a1cb9072
De$aL1n@teS
real 49.428 user 87.193 sys 3.816
Now I have two questions which may be related.
brute.par+th
so much slower than 1-core brute.par
? What’s it doing in those threads? What’s it doing in kernel mode for a whopping 17 seconds?brute.par+th
vary so wildly, instead of being reliably twice the performance of 1-core brute.par+th
?I’m using GHC 7.4.1 and parallel-3.2.0.2. I know I should probably be using newer versions, but this is what I have handy at the moment.
I tried compiling with -rtsopts
and disabling threaded GC with +RTS -qg
, to no effect.
I tried ThreadScope, but it swapped like crazy and couldn’t load an eventlog even when I used a much smaller dictionary.
Upvotes: 3
Views: 151
Reputation: 143
The unexpected performance could be explained by the fact that "Crypto.Hash.SHA256" invokes unsafe FFI code. GHC does not guarantee that other Haskell threads will not be blocked during the call to this code. If the threads that par spawns are being blocked by GHC it would cause a lot of contention in your program, which leads to the long run times and inconsistent run time results.
Upvotes: 3