Dmitry Galchinsky
Dmitry Galchinsky

Reputation: 2201

Why Vector code is slower than standard list one

I have read about Vector library that uses modern optimization techniques and try to compare its perfomance with lists. The code below generates some sound-like data (which is important to my subject area) and sums the result:

import System.Environment (getArgs)
import System.TimeIt
import Data.List
import Data.Vector.Unboxed as V

x1 :: Int -> [Double]
x1 n = [1..(fromIntegral n)]

x2 :: Int -> V.Vector Double
x2 n = V.enumFromN 1 n

osc1 f = Prelude.map (\x -> sin(2*pi*x*f/44100.0))
osc2 f = V.map (\x -> sin(2*pi*x*f/44100.0))

sum1 = Data.List.foldl1' (+)
sum2 = V.foldl1' (+)

zip1 = Prelude.zipWith (+)
zip2 = V.zipWith (+)

main = do s <- getArgs
          let n = read (s !! 0) :: Int
          print "Prelude version"
          timeIt $ print $ sum1 $ zip1 (osc1 55.5 (x1 n)) (osc1 110.0 $ x1 n)
          print "Vector version"
          timeIt $ print $ sum2 $ zip2 (osc2 55.5 (x2 n)) (osc2 110.0 $ x2 n)

GHC 7.6.3 running on win7 with vector0.10.0.1 and timeit1.0.0.0 gave me these results:

c:\coding>test 10000000
"Prelude version"
90.98579564908658
CPU time:   9.92s
"Vector version"
90.98579564908658
CPU time:  11.03s

Vector version is a little bit slower even being Unboxed and it takes 22.67 seconds for the boxed Vector version. Why does this happen? How should I write this code to get maximum perfomace?

UPD. After adding -O2 (**) I got more clear to me results. It looks like boxed vectors are harder to fuse.

                  List    Vector.Unboxed    Vector
ghc test.hs       9.78    10.94             21.95
ghc test.hs -O2   3.39    1.25              7.57

(**) I didn't note because ghc doesn't recompile unchanged files even if command line flags are different and I hadn't actually run -O2 version before noticed that. Sorry

Upvotes: 4

Views: 652

Answers (4)

Heatsink
Heatsink

Reputation: 7751

When compiling with optimizations enabled, Vector is faster. When optimizations are turned on, the compiler inlines and specializes vector functions, eliminating a lot of function calls and boxed temporary values.

Switching to streams gives you another 1.5× improvement by fusing all the steps of computation into a single loop. No arrays are built.

import Data.Vector.Fusion.Stream as S

x3 :: Int -> S.Stream Double
x3 n = S.enumFromStepN 1 1 n
osc3 f = S.map (\x -> sin(2*pi*x*f/44100.0))
sum3 = S.foldl1' (+)
zip_3 = S.zipWith (+)

main = do s <- getArgs
          let n = read (s !! 0) :: Int
          print "Stream version"
          timeIt $ print $ sum3 $ zip_3 (osc3 55.5 (x3 n)) (osc3 110.0 $ x3 n)

Stream fusion on Vectors won't fuse the input of zipWith, so the vector code doesn't optimize in the same way.

Compiling with -O2, the Prelude version is slowest and the Stream version is fastest.

$ ./Test 10000000
"Prelude version"
90.98579565011536
3.051188s
"Vector version"
90.98579565011924
1.81228s
"Stream version"
90.98579565011907
1.155345s

Upvotes: 2

MdxBhmt
MdxBhmt

Reputation: 1310

It's a matter of optimization flags:

-o0

>test 10000000
"Prelude version"
90.98579564908658
CPU time:   6.66s
"Vector version"
90.98579564908658
CPU time:   8.27s

-o1

>test 10000000
"Prelude version"
90.98579565011536
CPU time:   2.70s
"Vector version"
90.98579565011924
CPU time:   1.62s

-o2

>test 10000000
"Prelude version"
90.98579565011536
CPU time:   2.72s
"Vector version"
90.98579565011924
CPU time:   1.34s

From Haskell tag info:

Performance issues

In case of performance issues, please make sure that you compile your code with optimizations enabled. Passing -O2 makes many performance issues go away.

update

For a quick explanation on why Unboxed is faster here's one:

The most flexible type is Data.Vector.Vector, which provides boxed arrays: arrays of pointers to Haskell values.

These arrays are suitable for storing complex Haskell types (sum types, or algebraic data types), but a better choice for simple data types is Data.Vector.Unboxed.

For Unboxed:

Simple, atomic types, and pair types can be stored in a more efficient manner: consecutive memory slots without pointers.

[off] Optimization's slightly changing the result, that's interesting. [/off]

Upvotes: 6

Waldheinz
Waldheinz

Reputation: 10487

I'd say that the vector version is forced to actually materialize the vector (allocating memory for it) and working with that much like an implementation using for loops and arrays in an imperative setting. In a sense it "does what one would expect" (when having an imperative background at least).

But there's something interesting happening with the version using lists, and this magic is called "stream fusion": The compiler is smart enough to figure out that it is sufficient to keep track of the sum in order to compute the final result. This is done by computing values and adding them up, finally printing out the sum. The actual list is never needed at all, so it's never allocated or traversed.

I did not verify this by looking at the generated Core, so...

Upvotes: 2

Dirk Holsopple
Dirk Holsopple

Reputation: 8831

This has to do with laziness. The example that uses lists can take advantage of lazy evaluation, so it effectively iterates through the range of numbers without ever storing any lists in memory. The example with vectors must actually allocate a vector in memory which takes some extra time.

For situations like this where the list never needs to be stored in memory, lists are probably faster. For situations where you actually do need to store the data in memory, vectors will usually be faster.

Upvotes: 0

Related Questions