Spark scala join RDD between 2 datasets

Question

Supposed i have two dataset as following:

Dataset 1:

id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68

Dataset 2:

id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great

I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:

Bill, 4
Ramond, 2 
..
..

I managed to join both of them together but not sure how to count how many time it appear for each person.

Any suggestion would be appreciated.

Edited: my join code:

val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
  var data = x.split(",")
  new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
  var data = x.split(",")
  new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
  println(f._1+" "+f._2._1+" "+f._2._2)
})

Ramesh Maharjan · Accepted Answer

Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes.

First step would be to read the two files you have into dataframes as below

val df1 = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", true)
  .load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
  .option("header", true)
  .load("dataset1")

so that you should be having

df1
+---+------+-----+
|id |name  |score|
+---+------+-----+
|1  |Bill  |200  |
|2  |Bew   |23   |
|3  |Amy   |44   |
|4  |Ramond|68   |
+---+------+-----+

df2
+---+-------------------+
|id |message            |
+---+-------------------+
|1  |i love Bill        |
|2  |i hate Bill        |
|3  |Bew go go !        |
|4  |Amy is the best    |
|5  |Ramond is the wrost|
|6  |Bill go go         |
|7  |Bill i love ya     |
|8  |Ramond is Bad      |
|9  |Amy is great       |
+---+-------------------+

join, groupBy and count should give your desired output as

df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)

Final output would be

+------+-----+
|name  |count|
+------+-----+
|Ramond|2    |
|Bill  |4    |
|Amy   |2    |
|Bew   |1    |
+------+-----+

Spark scala join RDD between 2 datasets

Answers (1)

Related Questions