In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

Question

These two programs do the same thing, but one runs 10x faster.

This takes approx. 10 seconds on my machine:

import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL

theValueOne=B.singleton 1

main = replicateM_ 100000000 $ B.putStr theValueOne

The second version uses output-lazy IO. It is done in about 1 second (as fast as c):

import qualified Data.ByteString.Lazy as BL

main = BL.putStr $ BL.pack $ replicate 100000000 1

Question: Why is the non-lazy version so slow? More importantly, how can I make it fast? (I've tried recursion, forM, modifying the output buffer using hSetBuffering... Nothing has made a difference)

Note- This is more than just an academic question. The non-lazy version is an extremely simplified version of an executable my company uses in production, which is also slow in the same way. It would be nearly impossible to re-architect the larger program around the analogous lazy solution.

K. A. Buhr · Accepted Answer

Updated: Added possible source of problem and a solution.

I don't think it has anything to do with lazy I/O. If you rewrite the strict I/O version to write two bytes at once:

theValueOne = B.singleton 1
main = replicateM_ 50000000 $ B.putStr (theValueOne <> theValueOne)

that halves the time. Write ten bytes at once:

theValueOne = B.singleton 1
main = replicateM_ 10000000 $ B.putStr (foldMap id (replicate 10 theValueOne))

and it's already faster than the lazy I/O version.

The issue is that there's a bit of overhead in a B.hPutStr call, much more than the overhead of a C fwrite call, and it's just not a particularly efficient way to write a single byte.

A good chunk of the overhead comes from the fact that Haskell I/O buffers have immutable metadata. Even though the buffer content itself is mutable, the pointers to valid data within the buffer are immutable, and so writing a single byte requires a heap allocation of a new GHC.IO.Buffer.Buffer structure which GHC can't optimize away

One solution is to use a hand-crafted buffering structure with a mutable pointer. The following works, and it's about twice as fast as the lazy I/O version in the original question.

{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Data.IORef
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO

data WriteBuffer = WriteBuffer
  { handle :: !Handle
  , capacity :: !Int
  , used :: !(IORef Int)
  , content :: !(ForeignPtr Word8)
  }

newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
  hSetBinaryMode h True
  hSetBuffering h NoBuffering
  WriteBuffer h cap <$> newIORef 0 <*> mallocForeignPtrBytes cap
  where cap = 4096

flushBuffer :: WriteBuffer -> IO ()
flushBuffer WriteBuffer{..} = do
  n <- readIORef used
  withForeignPtr content $ \p -> hPutBuf handle p n
  writeIORef used 0

writeByte :: Word8 -> WriteBuffer -> IO ()
writeByte w buf@(WriteBuffer{..}) = do
  n <- readIORef used
  withForeignPtr content $ \p -> poke (plusPtr p n) w
  let n' = n + 1
  writeIORef used n'
  when (n' == capacity) $
    flushBuffer buf

main :: IO ()
main = do
  b <- newBuffer stdout
  replicateM_ 100000000 (writeByte 1 b)
  flushBuffer b

Someone ironically, converting this to a version using an immutable counter and passing the WriteBuffer as state through foldM doubles the speed again, so it's about 4 times as fast as the lazy I/O version in the original question:

{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}

import Control.Monad
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO

data WriteBuffer = WriteBuffer
  { handle :: !Handle
  , capacity :: !Int
  , used :: !Int
  , content :: !(ForeignPtr Word8)
  }

newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
  hSetBinaryMode h True
  hSetBuffering h NoBuffering
  WriteBuffer h cap 0 <$> mallocForeignPtrBytes cap
  where cap = 4096

flushBuffer :: WriteBuffer -> IO WriteBuffer
flushBuffer buf@WriteBuffer{..} = do
  withForeignPtr content $ \p -> hPutBuf handle p used
  return $ buf { used = 0 }

writeByte :: Word8 -> WriteBuffer -> IO WriteBuffer
writeByte w buf@(WriteBuffer{..}) = do
  withForeignPtr content $ \p -> poke (plusPtr p used) w
  let used' = used + 1
      buf' = buf { used = used' }
  if (used' == capacity)
    then flushBuffer buf'
    else return buf'

main :: IO ()
main = do
  b <- newBuffer stdout
  b' <- foldM (\s _ -> writeByte 1 s) b [(1::Int)..100000000]
  void (flushBuffer b')

The reason this one is so fast seems to be that GHC is able to optimize away the WriteBuffer constructor entirely from the fold and just pass around unboxed pointers and integers in the loop. My guess is that if I modified the mutable version above to avoid boxing and unboxing the integer in the used IORef, it would be similarly fast.

In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

Answers (1)

Related Questions