vzex
vzex

Reputation: 437

how to process large file without stack overflow error in haskell?

here is my code:(get file line num and word count)

import System.IO
import Data.Maybe
readL::(Int,Int,Int)->IO()
readL (w,l,-1) = do
                putStrLn $ "word:" ++(show w )++"\nline:"++(show l)
readL (w,l,0) = do 
                s<-hIsEOF stdin
                if s 
                        then readL (w,l,-1)
                        else 
                                do
                                f<-getLine
                                readL (w+length f,l+1,0)

main = do
        hSetBinaryMode stdin True
        readL (0,0,0)

when I process a file with size 100m,it just crashes,with error: Stack space overflow: current size 8388608 bytes

Is there something I wrote wrong?

I also have another version here:

import System.IO
import Data.List
main = do
        hSetBinaryMode stdin True
        interact $ (\(w,l)->"line:"++(show l)++"\nwords:"++(show w)++"\n"). foldl' (\(w,l) r-> (w + length r,l+1) ) (0,0)   .lines

this have the same problem too... and with lots of memory,so,anybody can slove this?I'm just a new learner in haskell.

Upvotes: 0

Views: 212

Answers (1)

Daniel Fischer
Daniel Fischer

Reputation: 183858

The problem is that neither the w nor the l parameter to readL are evaluated before the end of input is reached. So for an input with many lines, you build huge thunks (((0 + length line1) + length line2) ... + length lastline), similar for l, and for more than half a million lines or so, evaluating that thunk will not fit in the available stack. Additionally, the length f holds on to the line read until it is evaluated, causing unnecessarily large memory use.

You have to keep the accumulating parameters evaluated in the loop, the easiest way is with bang-patterns

readL !(!w,!l,-1) = ...

or a seq:

readL (w,l,c)
    | w `seq` l `seq` (c == -1) = putStrLn $ "word:" ++(show w )++"\nline:"++(show l)
readL (w,l,0) = do 
                s<-hIsEOF stdin
                if s 
                        then readL (w,l,-1)
                        else 
                                do
                                f<-getLine
                                readL (w+length f,l+1,0)

The foldl' version has the same problem,

foldl' (\(w,l) r-> (w + length r,l+1) ) (0,0)

only evaluates the accumulator pair to weak head normal form, that is to the outermost constructor, here (,). It does not force evaluation of the components. To do that, you can

  • use a strict pair type for the fold

    data P = P !Int !Int
    
    foo = ... . foldl' (\(P w l) r -> P (w + length r) (l+1)) (P 0 0) . lines
    
  • or use seq in the folded function

    ... . foldl' (\(w,l) r -> w `seq` l `seq` (w + length r, l+1)) . lines
    

Upvotes: 2

Related Questions