Reputation: 12070
I need to prefill up a mutable IOVector (with a given value). The Haskell code I am using is
-- use Control.Monad, Data.Vector.Unboxed.Mutable, Data.Word, and run in IO monad
buff <- new buffsize::IO (IOVector Word8)
forM_ [0..buffsize-1] $ \p -> write buff p (100::Word8)
This runs 1-2 orders of magnitude slower than the comparable c code
char *buff = (char *) malloc(BUFFERSIZE);
char *maxbuff = buff + BUFFERSIZE;
for(char *p = buff; p < maxbuff; p++) *p = 0;
For instance, for buffsize=4000000000, it takes about 7 seconds in c, but about 3 minutes in Haskell.
(FYI, I am using Ubuntu running on an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, GHC version 7.8.4, but these specifics probably shouldn't matter)
Does anyone see any changes I could make to the Haskell code to get comparable speeds?
Upvotes: 1
Views: 164
Reputation: 48591
The most correct way to do this is to use set
. Specifically, use slice
to select the portion of the vector you wish to fill, then use set
to fill it. If any other approach benchmarks faster (fairly unlikely), you should file a bug report.
See the top of No_signal's answer for an implementation of this idea.
Upvotes: 2
Reputation: 402
edit : i will just put this here
growAndFill :: (Unbox a) => Int -> a -> IOVector a -> IO (IOVector a)
growAndFill i x v = do
newV <- V.grow v i
V.set (V.unsafeSlice (V.length v) i newV) x
return newV
edit : go with slice then set over this (look at code above)
This should work and it is very fast.
The code is based on the set function
unsafeFill :: (Unbox a) => IOVector a -> Int -> Int -> a -> IO ()
unsafeFill !v lo hi x
| hi < lo = return ()
| otherwise = do
unsafeWrite v lo x
do_set 1
where
hi' = hi - lo + 1
do_set i | 2*i < hi' = do unsafeCopy (unsafeSlice (lo+i) i v)
(unsafeSlice lo i v)
do_set (2*i)
| otherwise = unsafeCopy (unsafeSlice (lo+i) (hi'-i) v)
(unsafeSlice lo (hi'-i) v)
fillBuff5 :: IOVector Word8 -> IO ()
fillBuff5 buff = unsafeFillSlice buff 0 (buffsize-1) 100
results compiled with -O2
benchmarking original
time 691.7 μs (689.6 μs .. 694.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 691.2 μs (690.2 μs .. 692.6 μs)
std dev 3.848 μs (3.054 μs .. 5.367 μs)
benchmarking unsafeWrite
time 552.9 μs (551.7 μs .. 554.2 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 552.9 μs (551.9 μs .. 554.1 μs)
std dev 3.485 μs (2.617 μs .. 5.655 μs)
benchmarking unsafeWrite + recursive
time 549.8 μs (546.1 μs .. 553.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 554.4 μs (552.7 μs .. 556.3 μs)
std dev 6.096 μs (4.802 μs .. 7.698 μs)
benchmarking recursive
time 827.9 μs (823.9 μs .. 831.0 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 817.9 μs (815.3 μs .. 821.5 μs)
std dev 10.43 μs (9.140 μs .. 12.49 μs)
benchmarking unsafeFill
time 65.13 μs (64.74 μs .. 65.57 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 65.20 μs (64.88 μs .. 66.17 μs)
std dev 1.839 μs (731.5 ns .. 3.667 μs)
Upvotes: 1
Reputation: 27003
Three main things:
write
is bounds-checked, while a direct pointer write in C isn't. Change that to unsafeWrite
if you want to keep C's lack of safety.
forM_ [0..buffsize-1]
has overhead due to being optimized for generality. If you want to remove all the generality, like the C loop, write the loop as something directly recursive.
Use the llvm backend for code that needs to optimize tight loops.
I coded up a criterion benchmark to test a whole bunch of variants:
import Control.Monad
import Data.Vector.Unboxed.Mutable
import Data.Word
import Criterion.Main
buffsize :: Int
buffsize = 1000000
fillBuff1 :: IOVector Word8 -> IO ()
fillBuff1 buff = do
forM_ [0..buffsize-1] $ \p -> write buff p 100
fillBuff2 :: IOVector Word8 -> IO ()
fillBuff2 buff = do
forM_ [0..buffsize-1] $ \p -> unsafeWrite buff p 100
fillBuff3 :: IOVector Word8 -> IO ()
fillBuff3 buff = do
let fill n | n < buffsize = unsafeWrite buff n 100 >> fill (n + 1)
| otherwise = return ()
fill 0
fillBuff4 :: IOVector Word8 -> IO ()
fillBuff4 buff = do
let fill n | n < buffsize = write buff n 100 >> fill (n + 1)
| otherwise = return ()
fill 0
main = do
buff <- new buffsize
let b n f = bench n . whnfIO . f $ buff
defaultMain [ b "original" fillBuff1
, b "unsafeWrite" fillBuff2
, b "unsafeWrite + recursive" fillBuff3
, b "recursive" fillBuff4
]
Note that I'm only benchmarking filling, not allocation + filling.
Here's a typical session without llvm:
carl@debian:~/hask$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4
carl@debian:~/hask$ ghc -O2 mutvectorwrite
[1 of 1] Compiling Main ( mutvectorwrite.hs, mutvectorwrite.o )
Linking mutvectorwrite ...
carl@debian:~/hask$ ./mutvectorwrite
benchmarking original
time 6.659 ms (6.599 ms .. 6.728 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 6.638 ms (6.599 ms .. 6.683 ms)
std dev 120.7 μs (97.36 μs .. 165.9 μs)
benchmarking unsafeWrite
time 5.413 ms (5.319 ms .. 5.524 ms)
0.998 R² (0.995 R² .. 0.999 R²)
mean 5.346 ms (5.309 ms .. 5.394 ms)
std dev 127.4 μs (85.00 μs .. 220.2 μs)
benchmarking unsafeWrite + recursive
time 3.363 ms (3.323 ms .. 3.409 ms)
0.999 R² (0.998 R² .. 0.999 R²)
mean 3.371 ms (3.343 ms .. 3.411 ms)
std dev 104.6 μs (65.11 μs .. 187.1 μs)
variance introduced by outliers: 16% (moderately inflated)
benchmarking recursive
time 3.389 ms (3.330 ms .. 3.438 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 3.435 ms (3.424 ms .. 3.451 ms)
std dev 43.38 μs (34.49 μs .. 67.38 μs)
And a typical session with llvm:
carl@debian:~/hask$ ghc -O2 -fllvm mutvectorwrite
[1 of 1] Compiling Main ( mutvectorwrite.hs, mutvectorwrite.o )
Linking mutvectorwrite ...
carl@debian:~/hask$ ./mutvectorwrite
benchmarking original
time 5.302 ms (5.251 ms .. 5.365 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 5.286 ms (5.262 ms .. 5.322 ms)
std dev 87.47 μs (63.29 μs .. 115.0 μs)
benchmarking unsafeWrite
time 3.929 ms (3.867 ms .. 4.001 ms)
0.998 R² (0.996 R² .. 0.999 R²)
mean 4.039 ms (3.994 ms .. 4.131 ms)
std dev 204.2 μs (114.6 μs .. 378.5 μs)
variance introduced by outliers: 30% (moderately inflated)
benchmarking unsafeWrite + recursive
time 496.4 μs (492.8 μs .. 500.8 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 496.6 μs (492.8 μs .. 503.9 μs)
std dev 17.46 μs (9.971 μs .. 31.42 μs)
variance introduced by outliers: 27% (moderately inflated)
benchmarking recursive
time 556.6 μs (548.4 μs .. 563.8 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 565.4 μs (559.7 μs .. 574.3 μs)
std dev 23.95 μs (16.41 μs .. 33.78 μs)
variance introduced by outliers: 35% (moderately inflated)
Performance gets down to pretty reasonable when you combine everything.
Upvotes: 4