Reputation: 786
If I have a variable such as books: RDD[(String, Integer, Integer)]
, how do I want to merge keys with the same String (could represent title), and then sum the corresponding two integers (could represent pages and price).
ex:
[("book1", 20, 10),
("book2", 5, 10),
("book1", 100, 100)]
becomes
[("book1", 120, 110),
("book2", 5, 10)]
Upvotes: 2
Views: 2621
Reputation: 3260
With an RDD
you can use reduceByKey
.
case class Book(name: String, i: Int, j: Int) {
def +(b: Book) = if(name == b.name) Book(name, i + b.i, j + b.j) else throw Exception
}
val rdd = sc.parallelize(Seq(
Book("book1", 20, 10),
Book("book2",5,10),
Book("book1",100,100)))
val aggRdd = rdd.map(book => (book.name, book))
.reduceByKey(_+_) // reduce calling our defined `+` function
.map(_._2) // we don't need the tuple anymore, just get the Books
aggRdd.foreach(println)
// Book(book1,120,110)
// Book(book2,5,10)
Upvotes: 4
Reputation: 7926
Try converting it first to a key-tuple RDD
and then performing a reduceByKey
:
yourRDD.map(t => (t._1, (t._2, t._3)))
.reduceByKey((acc, elem) => (acc._1 + elem._1, acc._2 + elem._2))
Output:
(book2,(5,10))
(book1,(120,110))
Upvotes: 2
Reputation: 35249
Just use Dataset
:
val spark: SparkSession = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.parallelize(Seq(
("book1", 20, 10), ("book2", 5, 10), ("book1", 100, 100)
))
spark.createDataFrame(rdd).groupBy("_1").sum().show()
// +-----+-------+-------+
// | _1|sum(_2)|sum(_3)|
// +-----+-------+-------+
// |book1| 120| 110|
// |book2| 5| 10|
// +-----+-------+-------+
Upvotes: 2