Reputation: 2201
I have read about Vector
library that uses modern optimization techniques and try to compare its perfomance with lists. The code below generates some sound-like data (which is important to my subject area) and sums the result:
import System.Environment (getArgs)
import System.TimeIt
import Data.List
import Data.Vector.Unboxed as V
x1 :: Int -> [Double]
x1 n = [1..(fromIntegral n)]
x2 :: Int -> V.Vector Double
x2 n = V.enumFromN 1 n
osc1 f = Prelude.map (\x -> sin(2*pi*x*f/44100.0))
osc2 f = V.map (\x -> sin(2*pi*x*f/44100.0))
sum1 = Data.List.foldl1' (+)
sum2 = V.foldl1' (+)
zip1 = Prelude.zipWith (+)
zip2 = V.zipWith (+)
main = do s <- getArgs
let n = read (s !! 0) :: Int
print "Prelude version"
timeIt $ print $ sum1 $ zip1 (osc1 55.5 (x1 n)) (osc1 110.0 $ x1 n)
print "Vector version"
timeIt $ print $ sum2 $ zip2 (osc2 55.5 (x2 n)) (osc2 110.0 $ x2 n)
GHC 7.6.3 running on win7 with vector0.10.0.1 and timeit1.0.0.0 gave me these results:
c:\coding>test 10000000
"Prelude version"
90.98579564908658
CPU time: 9.92s
"Vector version"
90.98579564908658
CPU time: 11.03s
Vector version is a little bit slower even being Unboxed
and it takes 22.67 seconds for the boxed Vector version. Why does this happen? How should I write this code to get maximum perfomace?
UPD. After adding -O2
(**) I got more clear to me results. It looks like boxed vectors are harder to fuse.
List Vector.Unboxed Vector
ghc test.hs 9.78 10.94 21.95
ghc test.hs -O2 3.39 1.25 7.57
(**) I didn't note because ghc doesn't recompile unchanged files even if command line flags are different and I hadn't actually run -O2
version before noticed that. Sorry
Upvotes: 4
Views: 652
Reputation: 7751
When compiling with optimizations enabled, Vector
is faster. When optimizations are turned on, the compiler inlines and specializes vector functions, eliminating a lot of function calls and boxed temporary values.
Switching to streams gives you another 1.5× improvement by fusing all the steps of computation into a single loop. No arrays are built.
import Data.Vector.Fusion.Stream as S
x3 :: Int -> S.Stream Double
x3 n = S.enumFromStepN 1 1 n
osc3 f = S.map (\x -> sin(2*pi*x*f/44100.0))
sum3 = S.foldl1' (+)
zip_3 = S.zipWith (+)
main = do s <- getArgs
let n = read (s !! 0) :: Int
print "Stream version"
timeIt $ print $ sum3 $ zip_3 (osc3 55.5 (x3 n)) (osc3 110.0 $ x3 n)
Stream fusion on Vector
s won't fuse the input of zipWith
, so the vector code doesn't optimize in the same way.
Compiling with -O2
, the Prelude version is slowest and the Stream
version is fastest.
$ ./Test 10000000
"Prelude version"
90.98579565011536
3.051188s
"Vector version"
90.98579565011924
1.81228s
"Stream version"
90.98579565011907
1.155345s
Upvotes: 2
Reputation: 1310
It's a matter of optimization flags:
-o0
>test 10000000
"Prelude version"
90.98579564908658
CPU time: 6.66s
"Vector version"
90.98579564908658
CPU time: 8.27s
-o1
>test 10000000
"Prelude version"
90.98579565011536
CPU time: 2.70s
"Vector version"
90.98579565011924
CPU time: 1.62s
-o2
>test 10000000
"Prelude version"
90.98579565011536
CPU time: 2.72s
"Vector version"
90.98579565011924
CPU time: 1.34s
From Haskell tag info:
Performance issues
In case of performance issues, please make sure that you compile your code with optimizations enabled. Passing -O2 makes many performance issues go away.
update
For a quick explanation on why Unboxed is faster here's one:
The most flexible type is Data.Vector.Vector, which provides boxed arrays: arrays of pointers to Haskell values.
These arrays are suitable for storing complex Haskell types (sum types, or algebraic data types), but a better choice for simple data types is Data.Vector.Unboxed.
For Unboxed:
Simple, atomic types, and pair types can be stored in a more efficient manner: consecutive memory slots without pointers.
[off] Optimization's slightly changing the result, that's interesting. [/off]
Upvotes: 6
Reputation: 10487
I'd say that the vector
version is forced to actually materialize the vector (allocating memory for it) and working with that much like an implementation using for
loops and arrays in an imperative setting. In a sense it "does what one would expect" (when having an imperative background at least).
But there's something interesting happening with the version using lists, and this magic is called "stream fusion": The compiler is smart enough to figure out that it is sufficient to keep track of the sum in order to compute the final result. This is done by computing values and adding them up, finally printing out the sum. The actual list is never needed at all, so it's never allocated or traversed.
I did not verify this by looking at the generated Core, so...
Upvotes: 2
Reputation: 8831
This has to do with laziness. The example that uses lists can take advantage of lazy evaluation, so it effectively iterates through the range of numbers without ever storing any lists in memory. The example with vectors must actually allocate a vector in memory which takes some extra time.
For situations like this where the list never needs to be stored in memory, lists are probably faster. For situations where you actually do need to store the data in memory, vectors will usually be faster.
Upvotes: 0