Reputation: 71
I have a very simple piece of code in Haskell and Scala. This code is intended to run in a very tight loop so performance matters. The problem is that Haskell is about 10x slower than Scala. Here it is Haskell code.
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as VU
newtype AffineTransform = AffineTransform {get :: (VU.Vector Double)} deriving (Show)
{-# INLINE runAffineTransform #-}
runAffineTransform :: AffineTransform -> (Double, Double) -> (Double, Double)
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2,
get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5)
testAffineTransformSpeed :: AffineTransform -> Int -> (Double, Double)
testAffineTransformSpeed affTr count = go count (0.5, 0.5)
where go :: Int -> (Double, Double) -> (Double, Double)
go 0 res = res
go !n !res = go (n-1) (runAffineTransform affTr res)
What more can be done to improve this code?
Upvotes: 4
Views: 632
Reputation: 183978
The main problem is that
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x
+ get affTr `VU.unsafeIndex` 1 * y
+ get affTr `VU.unsafeIndex` 2,
get affTr `VU.unsafeIndex` 3 * x
+ get affTr `VU.unsafeIndex` 4 * y
+ get affTr `VU.unsafeIndex` 5)
produces a pair of thunks. The components are not evaluated when runAffineTransform
is called, they remain thunks until some consumer demands them to be evaluated.
testAffineTransformSpeed affTr count = go count (0.5, 0.5)
where go :: Int -> (Double, Double) -> (Double, Double)
go 0 res = res
go !n !res = go (n-1) (runAffineTransform affTr res)
is not that consumer, the bang on res
only evaluates it to the outermost constructor, (,)
, and you get a result of
runAffineTransform affTr (runAffineTrasform affTr (runAffineTransform affTr (...)))
which is evaluated only at the end, when finally the normal form is demanded.
If you force the components of the result to be evaluated immediately,
runAffineTransform affTr (!x, !y) = case
( get affTr `U.unsafeIndex` 0 * x
+ get affTr `U.unsafeIndex` 1 * y
+ get affTr `U.unsafeIndex` 2
, get affTr `U.unsafeIndex` 3 * x
+ get affTr `U.unsafeIndex` 4 * y
+ get affTr `U.unsafeIndex` 5
) of (!a,!b) -> (a,b)
and let it be inlined, the main difference to jtobin's version using a custom strict pair of unboxed Double#
s is that for the loop in testAffineTransformSpeed
you get one initial iteration using the boxed Double
s as argument, and at the end, the components of the result are boxed, which adds a bit of constant overhead (something around 5 nanoseconds per loop on my box). The main part of the loop takes an Int#
and two Double#
arguments in both cases and the loop body is identical except for the boxing when n = 0
is reached.
Of course, forcing the immediate evaluation of the components by using an unboxed strict pair type is nicer.
Upvotes: 8
Reputation: 3273
I defined the following strict/unboxed pair type:
import System.Random.MWC -- for later
import Control.DeepSeq
data SP = SP {
one :: {-# UNPACK #-} !Double
, two :: {-# UNPACK #-} !Double
} deriving Show
instance NFData SP where
rnf p = rnf (one p) `seq` rnf (two p) `seq` ()
and replaced it in the runAffineTransform
function:
runAffineTransform2 :: AffineTransform -> SP -> SP
runAffineTransform2 affTr !(SP x y) =
SP ( get affTr `U.unsafeIndex` 0 * x
+ get affTr `U.unsafeIndex` 1 * y
+ get affTr `U.unsafeIndex` 2 )
( get affTr `U.unsafeIndex` 3 * x
+ get affTr `U.unsafeIndex` 4 * y
+ get affTr `U.unsafeIndex` 5 )
{-# INLINE runAffineTransform2 #-}
then ran this benchmark suite:
main :: IO ()
main = do
g <- create
zs <- fmap (AffineTransform . U.fromList)
(replicateM 100000 (uniformR (0 :: Double, 1) g))
let myConfig = defaultConfig { cfgPerformGC = ljust True }
defaultMainWith myConfig (return ()) [
bench "yours" $ nf (testAffineTransformSpeed zs) 10
, bench "mine" $ nf (testAffineTransformSpeed2 zs) 10
]
Compiled with -O2
and ran, and observed some (~4x) speedup:
benchmarking yours
mean: 257.4559 ns, lb 256.2492 ns, ub 258.9761 ns, ci 0.950
std dev: 6.889905 ns, lb 5.688330 ns, ub 8.839753 ns, ci 0.950
found 5 outliers among 100 samples (5.0%)
3 (3.0%) high mild
2 (2.0%) high severe
variance introduced by outliers: 20.944%
variance is moderately inflated by outliers
benchmarking mine
mean: 69.56408 ns, lb 69.29910 ns, ub 69.86838 ns, ci 0.950
std dev: 1.448874 ns, lb 1.261444 ns, ub 1.718074 ns, ci 0.950
found 4 outliers among 100 samples (4.0%)
4 (4.0%) high mild
variance introduced by outliers: 14.190%
variance is moderately inflated by outliers
Full code is in a gist here.
EDIT
I also posted criterion's output report here.
Upvotes: 9