Reputation: 101
for training i wrote a short Haskell program as a replacement for a Perl script. The program reads a log file which contains multi-line messages and simply joins them to produce one line per message. My test input file has 14000 lines and a size of 1 MB. The version using Data.Text has a runtime of 3.5 secs, the one using Data.Text.Lazy only 0.1 secs (the original perl script needs 0.2 secs). I found in other posts that using Data.Text.Lazy would only make sense for really great amounts of data and didn't expect such a difference. Can anybody explain the reason why or what's wrong with my program ?
The relevant part of the source (the only difference between both versions is the import Data.Text*):
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Char (isDigit)
import Data.Monoid ((<>))
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as T
import System.Environment (getArgs, getProgName)
import System.IO (openFile, stdin, stdout, Handle,
IOMode(ReadMode,WriteMode))
main :: IO ()
main = do
(inp,out) <- getInput
actLog <- T.hGetContents inp
let newLog = processLog actLog
T.hPutStr out newLog
processLog :: T.Text -> T.Text
processLog = foldr joinLines "" . T.lines
joinLines :: T.Text -> T.Text -> T.Text
joinLines elem accu
| T.length elem == 0 = accu -- Blank Line
| T.head elem == ' ' = textElem <> accu -- Continuation
| isDigit (T.head elem) = "\n" <> textElem <> accu -- Start
| otherwise = accu -- Garbage
where textElem = T.strip elem
Upvotes: 3
Views: 365
Reputation: 15009
This looks like a data structures issue rather than a laziness issue. A strict Text
is essentially a single big chunk of memory, while a lazy Text
is essentially a linked list of strict Text
s ("chunks"). The way that a lazy Text
is split up into chunks isn't supposed to be part of the meaning of the value (which is just the text obtained by concatenating all the chunks). But it can have a big effect on the cost of operations.
You have a bunch of operations of the form short <> accu
where accu
is growing with the size of your output. If these are strict Text
s, this concatenation has to copy the entire contents of both short
and accu
into a new strict Text
value. The total runtime is necessarily quadratic. If they are lazy Text
s, then <>
has another option: it can prepend the list of chunks that is short
to the list of chunks that is accu
. This doesn't need to touch accu
at all, but even if it did, the spine of the linked list of chunks that makes up accu
could be much less data to process than the entire textual contents of accu
, depending on the chunk size. That's why your lazy Text
version is so much faster.
It looks like you could write your program in the form
processLog = T.unlines . processLines . T.lines
processLines :: [T.Text] -> [T.Text]
processLines = ...
which leaves the problem of how to concatenate up to the library function T.unlines
, in which case you can expect it to be efficient whether you use strict Text
or lazy Text
.
Upvotes: 1
Reputation: 480
The difference between the lazy and normal Data.Text is whether the entire file is read into memory at
actLog <- T.hGetContents inp
When processed lazy Haskell reads only the line directly required to produce the output. So instead of reading the entire file and then processing and writing as needed, it can now read, write and process as needed eliminating the wait for the entire file to be read in memory.
Upvotes: 0