How to use an iterator in scala to make a decision based on length memory efficiently

Question

I have a very long collection I need to iterate over in scala and I want to avoid keeping it all in memory. The solution I came up with is this:

(rows is the iterator I am trying to process and COMPONENT_LIMIT the estimate of how many objects I calculate I can keep in memory)

val ( processItr, countItr ) = rows.duplicate
val pastLimitItr = countItr.drop( COMPONENT_LIMIT )
if ( pastLimitItr.hasNext ) 
  new CustomIterator( processItr.buffered) 
else
  Iterator( MappperToObject.createObject(
            processItr.toList
          ) )

The problem I have is this: even though I do not need to use the pastLimitItr any more, as far as I can tell from scala source on def duplicate the queue will hang around so the memory used will be relative to the length of the iterator.

The question is: how can I get rid of the queue in the Partner object in def duplicate after I am done with the test? I do not need the duplicate at all after the test.

UPDATE: I should have added that the output iterator objects will contain some of the objects in the input iterator based on their content, so I cannot use grouped as suggested.

UPDATE: It looks like span is the right answer out of the options given in the answer. I was probably not specific enough in my question.

som-snytt · Accepted Answer

It sounds like you want to use:

val segments = iterator.grouped(LIMIT)
createObject(segments.next())

Though you if you did need duplicate, you could drain the duplicates.

You can also use iterator.span with a condition that counts:

scala> val it = (1 to 10).iterator
it: Iterator[Int] = non-empty iterator

scala> var n = 0 ; val (vs, rest) = it.span { _ => n += 1; n < 3 }
n: Int = 0
vs: Iterator[Int] = non-empty iterator
rest: Iterator[Int] = unknown-if-empty iterator

scala> vs.toList
res0: List[Int] = List(1, 2)

scala> rest.toList
res1: List[Int] = List(3, 4, 5, 6, 7, 8, 9, 10)

You could define that as Iterator::splitAt:

scala> implicit class splitItAt[A](it: Iterator[A]) {
     | def splitAt(i: Int): (Iterator[A], Iterator[A]) = {
     |   var n = 0
     |   it.span { _ => n += 1; n <= i }
     | }}
defined class splitItAt

scala> val (is, rest) = (1 to 10).iterator.splitAt(6)
is: Iterator[Int] = non-empty iterator
rest: Iterator[Int] = unknown-if-empty iterator

scala> is.toList
res2: List[Int] = List(1, 2, 3, 4, 5, 6)

But I see you actually want to use either the prefix or the remaining iterator.

I'd write a custom method. Or don't laugh:

scala> val (is, rest) = (1 to 10).iterator.splitAt(6)
is: Iterator[Int] = non-empty iterator
rest: Iterator[Int] = unknown-if-empty iterator

scala> is match { case it: collection.Iterator$Leading$1 if rest.hasNext => it.finish() ; rest ; case _ => is }
res6: Iterator[Int] = unknown-if-empty iterator

scala> res6.next
res7: Int = 7

That internal finish means you can use the rest without buffering the prefix.

And you can also cheat grouped, as implemented, and use the original iterator for rest:

scala> val it = (1 to 10).iterator
it: Iterator[Int] = non-empty iterator

scala> val g = it.grouped(3)
g: it.GroupedIterator[Int] = non-empty iterator

scala> val first = g.next
first: List[Int] = List(1, 2, 3)

scala> it.hasNext
res12: Boolean = true

scala> it.next
res13: Int = 4

The custom method with no internals to hold onto:

scala> :pa
// Entering paste mode (ctrl-D to finish)

implicit class splitItAt[A](private val it: Iterator[A]) extends AnyVal {
  def splitAt(i: Int): (List[A], Iterator[A]) = {
    val buf = mutable.ListBuffer.empty[A]
    var n = 0
    while (it.hasNext && n < i) {
      buf += it.next()
      n += 1
    }
    (buf.toList, it)
  }
}

// Exiting paste mode, now interpreting.

defined class splitItAt

scala> val (is, rest) = (1 to 10).iterator.splitAt(20)
is: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
rest: Iterator[Int] = empty iterator

scala> val (is, rest) = (1 to 10).iterator.splitAt(6)
is: List[Int] = List(1, 2, 3, 4, 5, 6)
rest: Iterator[Int] = non-empty iterator

scala> val (is, rest) = (1 to 10).iterator.splitAt(0)
is: List[Int] = List()
rest: Iterator[Int] = non-empty iterator

How to use an iterator in scala to make a decision based on length memory efficiently

Answers (1)

Related Questions