Spark: grouping during loading

Question

Usually I load csv files and then I run different kind of aggregations like for example "group by" with Spark. I was wondering if it is possible to start this sort of operations during the file loading (typically a few millions of rows) instead of sequentialize them and if it can be worthy (as time saving).

Example:

val csv = sc.textFile("file.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = data.take(1)
val rows = data.filter(line => header(0) != "id")
val trows = rows.map(row => (row(0), row))
trows.groupBy(//row(0) etc.)

For my understanding of how Spark works, the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv. If this is correct, can the loading and the grouping run at the "same" time instead of sequencing the two steps?

zero323 · Accepted Answer

the groupBy (or aggregate) will be "postponed" to the loading in memory of the whole file csv.

It is not the case. At the local (single partition) level Spark operates on lazy sequences so operations belonging to a single task (this includes map side aggregation) can squashed together.

In other words when you have chain of methods operations are performed line-by-line not transformation-by-transformation. In other words the first line will be mapped, filtered, mapped once again and passed to aggregator before the next one is accessed.

Spark: grouping during loading

Answers (2)

Related Questions