Haskell Vector performance compared to Scala

Question

I have a very simple piece of code in Haskell and Scala. This code is intended to run in a very tight loop so performance matters. The problem is that Haskell is about 10x slower than Scala. Here it is Haskell code.

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as VU

newtype AffineTransform = AffineTransform {get :: (VU.Vector Double)} deriving (Show)

{-# INLINE runAffineTransform #-}
runAffineTransform :: AffineTransform -> (Double, Double) -> (Double, Double)
runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x + get affTr `VU.unsafeIndex` 1 * y + get affTr `VU.unsafeIndex` 2, 
                                      get affTr `VU.unsafeIndex` 3 * x + get affTr `VU.unsafeIndex` 4 * y + get affTr `VU.unsafeIndex` 5)

testAffineTransformSpeed :: AffineTransform -> Int -> (Double, Double)
testAffineTransformSpeed affTr count = go count (0.5, 0.5)
  where go :: Int -> (Double, Double) -> (Double, Double)
        go 0 res = res
        go !n !res = go (n-1) (runAffineTransform affTr res)

What more can be done to improve this code?

Daniel Fischer · Accepted Answer

The main problem is that

runAffineTransform affTr (!x, !y) = (get affTr `VU.unsafeIndex` 0 * x
                                     + get affTr `VU.unsafeIndex` 1 * y
                                     + get affTr `VU.unsafeIndex` 2, 
                                       get affTr `VU.unsafeIndex` 3 * x
                                     + get affTr `VU.unsafeIndex` 4 * y
                                     + get affTr `VU.unsafeIndex` 5)

produces a pair of thunks. The components are not evaluated when runAffineTransform is called, they remain thunks until some consumer demands them to be evaluated.

testAffineTransformSpeed affTr count = go count (0.5, 0.5)
  where go :: Int -> (Double, Double) -> (Double, Double)
        go 0 res = res
        go !n !res = go (n-1) (runAffineTransform affTr res)

is not that consumer, the bang on res only evaluates it to the outermost constructor, (,), and you get a result of

runAffineTransform affTr (runAffineTrasform affTr (runAffineTransform affTr (...)))

which is evaluated only at the end, when finally the normal form is demanded.

If you force the components of the result to be evaluated immediately,

runAffineTransform affTr (!x, !y) = case
  (  get affTr `U.unsafeIndex` 0 * x
   + get affTr `U.unsafeIndex` 1 * y
   + get affTr `U.unsafeIndex` 2
  ,  get affTr `U.unsafeIndex` 3 * x
   + get affTr `U.unsafeIndex` 4 * y
   + get affTr `U.unsafeIndex` 5
  ) of (!a,!b) -> (a,b)

and let it be inlined, the main difference to jtobin's version using a custom strict pair of unboxed Double#s is that for the loop in testAffineTransformSpeed you get one initial iteration using the boxed Doubles as argument, and at the end, the components of the result are boxed, which adds a bit of constant overhead (something around 5 nanoseconds per loop on my box). The main part of the loop takes an Int# and two Double# arguments in both cases and the loop body is identical except for the boxing when n = 0 is reached.

Of course, forcing the immediate evaluation of the components by using an unboxed strict pair type is nicer.

Haskell Vector performance compared to Scala

Answers (2)

Related Questions