Reputation: 786

Aggregating sum for RDD in Scala (Spark)

If I have a variable such as books: RDD[(String, Integer, Integer)], how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price).

ex:

[("book1", 20, 10),
 ("book2", 5, 10),
 ("book1", 100, 100)]

becomes

[("book1", 120, 110),
 ("book2", 5, 10)]

Upvotes: 2

Answers (3)

soote

Reputation: 3260

With an RDD you can use reduceByKey.

case class Book(name: String, i: Int, j: Int) {
  def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}

val rdd = sc.parallelize(Seq(
   Book("book1", 20, 10), 
   Book("book2",5,10), 
   Book("book1",100,100)))

val aggRdd = rdd.map(book => (book.name, book))
   .reduceByKey(_+_) // reduce calling our defined `+` function
   .map(_._2)        // we don't need the tuple anymore, just get the Books

aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)

Upvotes: 4

SCouto

Reputation: 7926

Try converting it first to a key-tuple RDD and then performing a reduceByKey:

yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))

Output:

(book2,(5,10))
(book1,(120,110))

Upvotes: 2

Alper t. Turker

Reputation: 35249

Just use Dataset:

val spark: SparkSession = SparkSession.builder.getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(
  ("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))

spark.createDataFrame(rdd).groupBy("_1").sum().show()

// +-----+-------+-------+                                                         
// |   _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1|    120|    110|
// |book2|      5|     10|
// +-----+-------+-------+

Upvotes: 2

Aggregating sum for RDD in Scala (Spark)

Answers (3)

Related Questions