Gabi Purcaru
Gabi Purcaru

Reputation: 31564

Haskell speed / memory usage

I'm trying to process some Point Cloud data with Haskell, and it seems to use a LOT of memory. The code I'm using is below, it basically parses the data into a format I can work with. The dataset has 440MB with 10M rows. When I run it with runhaskell, it uses up all the ram in a short time (~3-4gb) and then crashes. If I compile it with -O2 and run it, it goes to 100% cpu and takes a long time to finish (~3 minutes). I should mention that I'm using an i7 cpu with 4GB ram and an SSD, so there should be plenty of resources. How can I improve the performance of this?

{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)

filename :: FilePath
filename = "sample.txt"

readTextMaybe = readMaybe . unpack

data Classification = Classification
    { id :: Int, description :: Text
    } deriving (Show)

data Point = Point
    { x :: Int, y :: Int, z :: Int, classification :: Classification
    } deriving (Show)

type PointCloud = [Point]

maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
    where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
          toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
          toMaybePoint _ _ _ _ _                                   = Nothing
          parse :: [Text] -> Maybe Point
          parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
          parse _                     = Nothing

readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines

main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)

Upvotes: 1

Views: 221

Answers (1)

dfeuer
dfeuer

Reputation: 48611

The reason this uses all your memory when compiled without optimization is most likely because sum is defined using foldl. Without the strictness analysis that comes with optimization, that will blow up badly. You can try using this function instead:

sum' :: Num n => [n] -> n
sum' = foldl' (+) 0

The reason this is slow when compiled with optimization seems likely related to the way you parse the input. A cons will be allocated for each character when reading in the input, and again when breaking the input into lines, and probably yet again when splitting on commas. Using a proper parsing library (any of them) will almost certainly help; using one of the streaming ones like pipes or conduit may or may not be best (I'm not sure).

Another issue, not related to performance: fromJust is rather poor form in general, and is a really bad idea when dealing with user input. You should instead mapM over the list in the Maybe monad, which will produce a Maybe [Point] for you.

Upvotes: 3

Related Questions