Reputation: 1636
I need to process a "big" file (something that does not fit in memory).
I want to batch-process the data. Let's say for the example that I want to insert them into a database. But since it is too big to fit in memory, it is too slow too to process elements one-by-one.
So I'l like to go from an Iterator[Something]
to an Iterator[Iterable[Something]]
to batch elements.
Starting with this:
CSVReader.open(new File("big_file"))
.iteratorWithHeaders
.map(Something.parse)
.foreach(Jdbi.insertSomething)
I could do something dirty in the foreach
statement with mutable sequences and flushes every x elements but I'm sure there is a smarter way to do this...
// Yuk... :-(
val buffer = ArrayBuffer[Something]()
CSVReader.open(new File("big_file"))
.iteratorWithHeaders
.map(Something.parse)
.foreach {
something =>
buffer.append(something)
if (buffer.size == 1000) {
Jdbi.insertSomethings(buffer.toList)
buffer.clear()
}
}
Jdbi.insertSomethings(buffer.toList)
Upvotes: 2
Views: 395
Reputation: 2130
If I undestood correctly your problem, you can just use Iterator.grouped. So adapting a little bit your example:
val si: Iterator[Something] = CSVReader.open(new File("big_file"))
.iteratorWithHeaders
.map(Something.parse)
val gsi: GroupedIterator[Something] = si.grouped(1000)
gsi.foreach { slst: List[Something] =>
Jdbi.insertSomethings(slst)
}
Upvotes: 2
Reputation: 139028
If your batches can have a fixed size (as in your example), the grouped
method on Scala's Iterator
does exactly what you want:
val iterator = Iterator.continually(1)
iterator.grouped(10000).foreach(xs => println(xs.size))
This will run in a constant amount of memory (not counting whatever text in stored by your terminal in memory, of course).
I'm not sure what your iteratorWithHeaders
returns, but if it's a Java iterator, you can convert it to a Scala one like this:
import scala.collection.JavaConverters.
val myScalaIterator: Iterator[Int] = myJavaIterator.asScala
This will remain appropriately lazy.
Upvotes: 4