jassinm
jassinm

Reputation: 7491

How to avoid Out of Memory when Parsing a file into a stream in scala

I am new to scala and you like to understand why the following code results in a GC overhead limit exceeded and what should be done to avoid it.

import scala.io.Source
import scala.annotation.tailrec

  def getItems(file: Source): Stream[String] = {
    @tailrec
    def acc(it: Iterator[String],
            item: String,
            items: Stream[String]): Stream[String] = {

      if(it.hasNext){
        val line = it.next
        line take 1 match {
          case " " =>
            acc(it, item + "\n" + line, items)
          case "1" =>
            acc(it, item, Stream.cons(item, items))
        }
      }
      else {
        Stream.cons(item, items)
      }
    }
    acc(file.getLines(), "", Stream.Empty)
  }

Upvotes: 0

Views: 249

Answers (3)

simpadjo
simpadjo

Reputation: 4017

Stream in scala is a leaky abstraction actually. It pretends to be a Seq but you can't use it as a regular collection if a stream is huge. Here is an article about streams http://blog.dmitryleskov.com/programming/scala/stream-hygiene-i-avoiding-memory-leaks/ In your case the rule 'don't store Streams in method arguments' is violated (items).

Upvotes: 0

chengpohi
chengpohi

Reputation: 14217

There are two reasons of you code maybe will cause OOM:

  1. item will recursively add with the file length, this maybe will very large depend on your file size.
  2. For your Stream is repeatedly appending the accumlated item to Stream, this also maybe will very large,that cause OOM.

There is a way maybe can save this scenario by using lazy evaluation and Stream without memorization.

Upvotes: 1

Joerg Schmuecker
Joerg Schmuecker

Reputation: 111

I am trying to figure out what you are actually trying to do but the problem is that you are recursing with your acc function until your input file has not more elements. Here is a very simple example that converts your iterator into a stream.

def convert[T]( iter : Iterator[T] ) : Stream[T] = 
  if ( iter.hasNext ) {
    Stream.cons( iter.next, convert( iter ) )
  } else {
    Stream.empty
  }

In addition you are appending all lines that start with a space to item. I don't know how many such lines you have in your input but if all lines would be starting with space, you would use (n^2)/2 characters if your input file has n characters. But I don't think that's why your recursion fails.

Upvotes: 0

Related Questions