Iterator[Something] to Iterator[Seq[Something]]

Question

I need to process a "big" file (something that does not fit in memory).

I want to batch-process the data. Let's say for the example that I want to insert them into a database. But since it is too big to fit in memory, it is too slow too to process elements one-by-one.

So I'l like to go from an Iterator[Something] to an Iterator[Iterable[Something]] to batch elements.

Starting with this:

CSVReader.open(new File("big_file"))
  .iteratorWithHeaders
  .map(Something.parse)
  .foreach(Jdbi.insertSomething)

I could do something dirty in the foreach statement with mutable sequences and flushes every x elements but I'm sure there is a smarter way to do this...

// Yuk... :-(
val buffer = ArrayBuffer[Something]()
CSVReader.open(new File("big_file"))
  .iteratorWithHeaders
  .map(Something.parse)
  .foreach {
     something =>
       buffer.append(something)
       if (buffer.size == 1000) {
         Jdbi.insertSomethings(buffer.toList)
         buffer.clear()
       }
   }
Jdbi.insertSomethings(buffer.toList)

Travis Brown · Accepted Answer

If your batches can have a fixed size (as in your example), the grouped method on Scala's Iterator does exactly what you want:

val iterator = Iterator.continually(1)

iterator.grouped(10000).foreach(xs => println(xs.size))

This will run in a constant amount of memory (not counting whatever text in stored by your terminal in memory, of course).

I'm not sure what your iteratorWithHeaders returns, but if it's a Java iterator, you can convert it to a Scala one like this:

import scala.collection.JavaConverters.

val myScalaIterator: Iterator[Int] = myJavaIterator.asScala

This will remain appropriately lazy.

Iterator[Something] to Iterator[Seq[Something]]

Answers (2)

Related Questions