Ralli
Ralli

Reputation: 101

Haskell: Data.Text vs. Data.Text.Lazy Performance

for training i wrote a short Haskell program as a replacement for a Perl script. The program reads a log file which contains multi-line messages and simply joins them to produce one line per message. My test input file has 14000 lines and a size of 1 MB. The version using Data.Text has a runtime of 3.5 secs, the one using Data.Text.Lazy only 0.1 secs (the original perl script needs 0.2 secs). I found in other posts that using Data.Text.Lazy would only make sense for really great amounts of data and didn't expect such a difference. Can anybody explain the reason why or what's wrong with my program ?

The relevant part of the source (the only difference between both versions is the import Data.Text*):

{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Char (isDigit)
import Data.Monoid ((<>))
import qualified Data.Text.Lazy    as T
import qualified Data.Text.Lazy.IO as T
import System.Environment (getArgs, getProgName)
import System.IO (openFile, stdin, stdout, Handle,
                  IOMode(ReadMode,WriteMode))

main :: IO ()
main = do
  (inp,out) <- getInput
  actLog <- T.hGetContents inp
  let newLog = processLog actLog
  T.hPutStr out newLog

processLog :: T.Text -> T.Text
processLog = foldr joinLines "" . T.lines

joinLines :: T.Text -> T.Text -> T.Text
joinLines elem accu
  | T.length elem == 0    = accu                     -- Blank Line
  | T.head elem   == ' '  = textElem <> accu         -- Continuation
  | isDigit (T.head elem) = "\n" <> textElem <> accu -- Start
  | otherwise             = accu                     -- Garbage
    where textElem = T.strip elem

Upvotes: 3

Views: 365

Answers (2)

Reid Barton
Reid Barton

Reputation: 15009

This looks like a data structures issue rather than a laziness issue. A strict Text is essentially a single big chunk of memory, while a lazy Text is essentially a linked list of strict Texts ("chunks"). The way that a lazy Text is split up into chunks isn't supposed to be part of the meaning of the value (which is just the text obtained by concatenating all the chunks). But it can have a big effect on the cost of operations.

You have a bunch of operations of the form short <> accu where accu is growing with the size of your output. If these are strict Texts, this concatenation has to copy the entire contents of both short and accu into a new strict Text value. The total runtime is necessarily quadratic. If they are lazy Texts, then <> has another option: it can prepend the list of chunks that is short to the list of chunks that is accu. This doesn't need to touch accu at all, but even if it did, the spine of the linked list of chunks that makes up accu could be much less data to process than the entire textual contents of accu, depending on the chunk size. That's why your lazy Text version is so much faster.

It looks like you could write your program in the form

processLog = T.unlines . processLines . T.lines
processLines :: [T.Text] -> [T.Text]
processLines = ...

which leaves the problem of how to concatenate up to the library function T.unlines, in which case you can expect it to be efficient whether you use strict Text or lazy Text.

Upvotes: 1

prof1990
prof1990

Reputation: 480

The difference between the lazy and normal Data.Text is whether the entire file is read into memory at

actLog <- T.hGetContents inp 

When processed lazy Haskell reads only the line directly required to produce the output. So instead of reading the entire file and then processing and writing as needed, it can now read, write and process as needed eliminating the wait for the entire file to be read in memory.

Upvotes: 0

Related Questions