Reputation: 1
I am new to Scala and Spark .
There are 2 RDDs like
RDD_A= (keyA,5),(KeyB,10)
RDD_B= (keyA,3),(KeyB,7)
how do I calculate : RDD_A-RDD_B so that I get (keyA,2),(KeyB,3)
I tried subtract and subtractByKey but I am unable to get similar output like above
Upvotes: 0
Views: 195
Reputation: 4045
RDD solution for the question Please find inline code comments for the explanation
object SubtractRDD {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate(); // Create Spark Session
val list1 = List(("keyA",5),("keyB",10))
val list2 = List(("keyA",3),("keyB",7))
val rdd1= spark.sparkContext.parallelize(list1) // convert list to RDD
val rdd2= spark.sparkContext.parallelize(list2)
val result = rdd1.join(rdd2) // Inner join RDDs
.map(x => (x._1, x._2._1 - x._2._2 )) // Combiner function for RDDs
.collectAsMap() // Collect result as Map
println(result)
}
}
Upvotes: 0
Reputation: 1572
Let's assume that each RDD has only one value with specified key:
val df =
Seq(
("A", 5),
("B", 10)
).toDF("key", "value")
val df2 =
Seq(
("A", 3),
("B", 7)
).toDF("key", "value")
You can merge these RDDs using union
and perform the computation via groupBy
as follows:
import org.apache.spark.sql.functions._
df.union(df2)
.groupBy("key")
.agg(first("value").minus(last("value")).as("value"))
.show()
will print:
+---+-----+
|key|value|
+---+-----+
| B| 3|
| A| 2|
+---+-----+
Upvotes: 1