How to efficiently group every k rows in spark dataset?

Question

I created a spark Dataset[Row], and the Row is Row(x: Vector). x here is a 1xp vector.

Is it possible to 1) group every k rows 2) concatenating these rows into a k x p matrix - mX i.e., change Dateset[Row(Vector)] to Dateset[Row(Matrix)] ?

Here is my current soluttion, convert this Dataset[Row] to RDD, and concatenate every k rows with zipWithIndex and aggregateByKey.

val dataRDD = data_df.rdd.zipWithIndex
    .map {  case (line, index) =>  (index/k, line) }
    .aggregateByKey(...) (..., ...)

But it seems it's not very efficient, is there a more efficient way to do this?

Thanks in advance.

Sim · Accepted Answer

There are two performance issues with your approach:

Using a global ordering
Doing a shuffle to build the groups of k

If you absolutely need a global ordering, starting from line 1, and you cannot break up your data into multiple partitions then Spark has to move all the data through a single core. You can speed that part up by finding a way to have more than one partition.

You can avoid a shuffle by processing the data one partition at a time using mapPartitions:

spark.range(1, 20).coalesce(1).mapPartitions(_.grouped(5)).show

+--------------------+
|               value|
+--------------------+
|     [1, 2, 3, 4, 5]|
|    [6, 7, 8, 9, 10]|
|[11, 12, 13, 14, 15]|
|    [16, 17, 18, 19]|
+--------------------+

Note that coalesce(1) above is forcing all 20 rows into a single partition.

How to efficiently group every k rows in spark dataset?

Answers (2)

Related Questions