jamshidh
jamshidh

Reputation: 12070

How to get C-like performance when filling up a mutable buffer in GHC Haskell

I need to prefill up a mutable IOVector (with a given value). The Haskell code I am using is

-- use Control.Monad, Data.Vector.Unboxed.Mutable, Data.Word, and run in IO monad
buff <- new buffsize::IO (IOVector Word8)
forM_ [0..buffsize-1] $ \p -> write buff p (100::Word8)

This runs 1-2 orders of magnitude slower than the comparable c code

char *buff = (char *) malloc(BUFFERSIZE);
char *maxbuff = buff + BUFFERSIZE;

for(char *p = buff; p < maxbuff; p++) *p = 0;

For instance, for buffsize=4000000000, it takes about 7 seconds in c, but about 3 minutes in Haskell.

(FYI, I am using Ubuntu running on an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, GHC version 7.8.4, but these specifics probably shouldn't matter)

Does anyone see any changes I could make to the Haskell code to get comparable speeds?

Upvotes: 1

Views: 164

Answers (3)

dfeuer
dfeuer

Reputation: 48591

The most correct way to do this is to use set. Specifically, use slice to select the portion of the vector you wish to fill, then use set to fill it. If any other approach benchmarks faster (fairly unlikely), you should file a bug report.

Edit

See the top of No_signal's answer for an implementation of this idea.

Upvotes: 2

No_signal
No_signal

Reputation: 402

edit : i will just put this here

growAndFill :: (Unbox a) => Int -> a -> IOVector a -> IO (IOVector a)
growAndFill i x v = do
                newV <- V.grow v i
                V.set (V.unsafeSlice (V.length v) i newV) x
                return newV

edit : go with slice then set over this (look at code above)

This should work and it is very fast.

The code is based on the set function

unsafeFill :: (Unbox a) => IOVector a -> Int -> Int -> a -> IO ()
unsafeFill !v lo hi x
    | hi < lo    = return ()
    | otherwise = do
                unsafeWrite v lo x
                do_set 1
    where
      hi' = hi - lo + 1
      do_set i | 2*i < hi' = do unsafeCopy (unsafeSlice (lo+i) i v)
                                              (unsafeSlice lo i v)
                                do_set (2*i)
               | otherwise = unsafeCopy (unsafeSlice (lo+i) (hi'-i) v)
                                             (unsafeSlice lo (hi'-i) v)

fillBuff5 :: IOVector Word8 -> IO ()
fillBuff5 buff = unsafeFillSlice buff 0 (buffsize-1) 100

results compiled with -O2

benchmarking original
time                 691.7 μs   (689.6 μs .. 694.6 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 691.2 μs   (690.2 μs .. 692.6 μs)
std dev              3.848 μs   (3.054 μs .. 5.367 μs)

benchmarking unsafeWrite
time                 552.9 μs   (551.7 μs .. 554.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 552.9 μs   (551.9 μs .. 554.1 μs)
std dev              3.485 μs   (2.617 μs .. 5.655 μs)

benchmarking unsafeWrite + recursive
time                 549.8 μs   (546.1 μs .. 553.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 554.4 μs   (552.7 μs .. 556.3 μs)
std dev              6.096 μs   (4.802 μs .. 7.698 μs)

benchmarking recursive
time                 827.9 μs   (823.9 μs .. 831.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 817.9 μs   (815.3 μs .. 821.5 μs)
std dev              10.43 μs   (9.140 μs .. 12.49 μs)

benchmarking unsafeFill
time                 65.13 μs   (64.74 μs .. 65.57 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 65.20 μs   (64.88 μs .. 66.17 μs)
std dev              1.839 μs   (731.5 ns .. 3.667 μs)

Upvotes: 1

Carl
Carl

Reputation: 27003

Three main things:

  1. write is bounds-checked, while a direct pointer write in C isn't. Change that to unsafeWrite if you want to keep C's lack of safety.

  2. forM_ [0..buffsize-1] has overhead due to being optimized for generality. If you want to remove all the generality, like the C loop, write the loop as something directly recursive.

  3. Use the llvm backend for code that needs to optimize tight loops.

I coded up a criterion benchmark to test a whole bunch of variants:

import Control.Monad
import Data.Vector.Unboxed.Mutable
import Data.Word

import Criterion.Main

buffsize :: Int
buffsize = 1000000

fillBuff1 :: IOVector Word8 -> IO ()
fillBuff1 buff = do
    forM_ [0..buffsize-1] $ \p -> write buff p 100

fillBuff2 :: IOVector Word8 -> IO ()
fillBuff2 buff = do
    forM_ [0..buffsize-1] $ \p -> unsafeWrite buff p 100

fillBuff3 :: IOVector Word8 -> IO ()
fillBuff3 buff = do
    let fill n | n < buffsize = unsafeWrite buff n 100 >> fill (n + 1)
               | otherwise    = return ()
    fill 0

fillBuff4 :: IOVector Word8 -> IO ()
fillBuff4 buff = do
    let fill n | n < buffsize = write buff n 100 >> fill (n + 1)
               | otherwise    = return ()
    fill 0

main = do
    buff <- new buffsize
    let b n f = bench n . whnfIO . f $ buff
    defaultMain [ b "original" fillBuff1
                , b "unsafeWrite" fillBuff2
                , b "unsafeWrite + recursive" fillBuff3
                , b "recursive" fillBuff4
                ]

Note that I'm only benchmarking filling, not allocation + filling.

Here's a typical session without llvm:

carl@debian:~/hask$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4
carl@debian:~/hask$ ghc -O2 mutvectorwrite
[1 of 1] Compiling Main             ( mutvectorwrite.hs, mutvectorwrite.o )
Linking mutvectorwrite ...
carl@debian:~/hask$ ./mutvectorwrite 
benchmarking original
time                 6.659 ms   (6.599 ms .. 6.728 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 6.638 ms   (6.599 ms .. 6.683 ms)
std dev              120.7 μs   (97.36 μs .. 165.9 μs)

benchmarking unsafeWrite
time                 5.413 ms   (5.319 ms .. 5.524 ms)
                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 5.346 ms   (5.309 ms .. 5.394 ms)
std dev              127.4 μs   (85.00 μs .. 220.2 μs)

benchmarking unsafeWrite + recursive
time                 3.363 ms   (3.323 ms .. 3.409 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 3.371 ms   (3.343 ms .. 3.411 ms)
std dev              104.6 μs   (65.11 μs .. 187.1 μs)
variance introduced by outliers: 16% (moderately inflated)

benchmarking recursive
time                 3.389 ms   (3.330 ms .. 3.438 ms)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 3.435 ms   (3.424 ms .. 3.451 ms)
std dev              43.38 μs   (34.49 μs .. 67.38 μs)

And a typical session with llvm:

carl@debian:~/hask$ ghc -O2 -fllvm mutvectorwrite
[1 of 1] Compiling Main             ( mutvectorwrite.hs, mutvectorwrite.o )
Linking mutvectorwrite ...
carl@debian:~/hask$ ./mutvectorwrite 
benchmarking original
time                 5.302 ms   (5.251 ms .. 5.365 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 5.286 ms   (5.262 ms .. 5.322 ms)
std dev              87.47 μs   (63.29 μs .. 115.0 μs)

benchmarking unsafeWrite
time                 3.929 ms   (3.867 ms .. 4.001 ms)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 4.039 ms   (3.994 ms .. 4.131 ms)
std dev              204.2 μs   (114.6 μs .. 378.5 μs)
variance introduced by outliers: 30% (moderately inflated)

benchmarking unsafeWrite + recursive
time                 496.4 μs   (492.8 μs .. 500.8 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 496.6 μs   (492.8 μs .. 503.9 μs)
std dev              17.46 μs   (9.971 μs .. 31.42 μs)
variance introduced by outliers: 27% (moderately inflated)

benchmarking recursive
time                 556.6 μs   (548.4 μs .. 563.8 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 565.4 μs   (559.7 μs .. 574.3 μs)
std dev              23.95 μs   (16.41 μs .. 33.78 μs)
variance introduced by outliers: 35% (moderately inflated)

Performance gets down to pretty reasonable when you combine everything.

Upvotes: 4

Related Questions