blauerreimers
blauerreimers

Reputation: 151

Slow stream fusion in Data.Vector.Storable.ZipWith but fast in Data.Vector.Unboxed.ZipWith

Currently, I found some strange behavior in my function. I have a transformation function that transforms from (x,a,b) -> (x,y,z). I'm using Data.Vector.Storable since I'm communicating with some external libs. I estimated the time for the transfer function for a test-dataset something below a dozen seconds. But my implementation is slow as hell! Here, is some kind-of:

import Data.Complex
import System.Random
import qualified Data.Vector as V
import qualified Data.Vector.Storable as VS

nx = 128
ny = 128
nz = 128
na = 128
nb = 16

transform :: VS.Vector (Complex Float) -> VS.Vector (Complex Float)->VS.Vector (Complex Float)->VS.Vector (Complex Float)
transform !inData !transfAArray !transfBArray= VS.concatMap
    (\x ->
       VS.concatMap
         (\y ->
            VS.map
              (\z ->
                 calcSum y z x) $
            VS.enumFromN 0 nz) $
       VS.enumFromN 0 ny ) $
  VS.enumFromN 0 nx
  where
    calcSum :: Int -> Int -> Int -> Complex Float
    calcSum y z x =
      VS.sum $
      VS.map
        (\b ->
           (*) (transfB y b) $ dotProd (transfA z) (inDataSlice x b) ) $
           -- (*) (transfB y b) $ dotProd (transfA z) (transfA z) ) $ Is fast
      VS.enumFromN 0 nb
    dotProd !a1 !a2 = VS.sum $ VS.zipWith (*) a1 a2
    inDataSlice x b = VS.slice (x*na*nb + b*na) na inData
    transfA z =  VS.slice (z * na) na transfAArray
    transfB y b =  VS.unsafeIndex transfBArray (y * nb + b)





randomComplex :: IO (Complex Float)
randomComplex = (:+) <$> randomIO <*> randomIO

main :: IO ()
main = do
  inData <- VS.generateM (nx * na * nb) (\i -> randomComplex) 
  transfAArray <- VS.generateM (nz * na) (\i -> randomComplex)
  transfBArray <- VS.generateM (ny * nb) (\i -> randomComplex)

  let !outData = transform inData transfAArray transfBArray 
  print $ VS.sum outData

It takes minutes to complete this! If I activate line 35 it works very fast - so there must be something with the stream fusion I guess (Profiling shows a lot of >>= in the process). I re-created the code in c++ and got a runtime about 5s (single threaded):

#include <string>
#include <iterator>
#include <iostream>
#include <algorithm>
#include <vector>
#include <complex>
#include <cmath>
#include <numeric>
 
int main()
{
    // transformation [nx*na*nb] -> [nx*ny*nz]
    // [x][y][z], with dimensions 128 128 128
    const unsigned int nx = 128;
    const unsigned int ny = 128;
    const unsigned int nz = 128;
    std::vector<std::complex<float>> out(nx*ny*nz);

    const unsigned int na = 128;
    const unsigned int nb = 16;
    std::vector<std::complex<float>> in(nx * na * nb);
    std::iota (std::begin(in), std::end(in), 0);

    std::vector<std::complex<float>> transA(na * nz);
    std::iota (std::begin(transA), std::end(transA), 0);
    std::vector<std::complex<float>> transB(nb * ny);
    std::iota (std::begin(transB), std::end(transB), 0);

    for (int x = 0; x < nx; x++) {
        for (int y = 0; y < ny; y++) {
            for (int z = 0; z < nz; z++) {
                for (int b = 0; b < nb; b++) {
                    std::complex<float> sum = 0;
                    for (int a = 0; a < na; a++) {
                        sum += transA[z*na + a] * in[x*na*nb+b*na+a];
                    }
                    out[x*ny*nz + y*nz + z]= sum * transB[b * ny + y];
                }
            }
        }
    }

    std::cout << out[12] << "done\n";
}

During my last days of despair I just found the following strange behaviour: If I replace Data.Vector.Storable with Data.Vector.Unboxed, it runs as fast as the c++ code!

Soo.. why?

  1. As far as I did understand, unboxed and storable are both using successive memory regions - so there should be no problem with Cache/Memory-Access times.
  2. In the Storable-version TB for heap was allocated, while the Unboxed-version just uses a few MB. I've the feeling that the Storable version is moving the chunks around in memory constantly.

Any ideas? And is there a simple way to achieve similar speeds with Storage? (I'm using ghc-8.10.4 from Stack)

(PS: Always compiled with -O2 and sometimes also with -fllvm -optlo-O2 :)

Upvotes: 3

Views: 131

Answers (0)

Related Questions