Charles Durham
Charles Durham

Reputation: 1707

Read large lines in huge file without buffering

I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.

I also tried using pipes-text with folds and view lines:

s <- Pipes.sum $ 
    folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s

to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)

I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.

Thanks!

Upvotes: 6

Views: 382

Answers (2)

Michael
Michael

Reputation: 2909

This is probably easiest as a fold over the decoded text stream

{-#LANGUAGE BangPatterns #-}
import Pipes 
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
  n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
  print n

It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.

You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.

Upvotes: 3

Michael Snoyman
Michael Snoyman

Reputation: 31315

The following code uses Conduit, and will:

  • UTF8-decode standard input
  • Run the lineC combinator as long as there is more data available
  • For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
  • Sum up the 1s yielded and print it

You can replace the yield 1 code with something which will do processing on the individual lines.

#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit

main :: IO ()
main = (runConduit
     $ stdinC
    .| decodeUtf8C
    .| peekForeverE (lineC (yield (1 :: Int)))
    .| sumC) >>= print

Upvotes: 7

Related Questions