Reputation: 107
Supposed i have two dataset as following:
Dataset 1:
id, name, score
1, Bill, 200
2, Bew, 23
3, Amy, 44
4, Ramond, 68
Dataset 2:
id,message
1, i love Bill
2, i hate Bill
3, Bew go go !
4, Amy is the best
5, Ramond is the wrost
6, Bill go go
7, Bill i love ya
8, Ramond is Bad
9, Amy is great
I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be:
Bill, 4
Ramond, 2
..
..
I managed to join both of them together but not sure how to count how many time it appear for each person.
Any suggestion would be appreciated.
Edited: my join code:
val rdd = sc.textFile("dataset1")
val rdd2 = sc.textFile("dataset2")
val rddPair1 = rdd.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
val rddPair2 = rdd2.map { x =>
var data = x.split(",")
new Tuple2(data(0), data(1))
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2)
})
Upvotes: 1
Views: 1377
Reputation: 41957
Using RDDs
, achieving the solution you desire, would be complex. Not so much using dataframes
.
First step would be to read the two files you have into dataframes
as below
val df1 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
val df2 = sqlContext.read.format("com.databricks.spark.csv")
.option("header", true)
.load("dataset1")
so that you should be having
df1
+---+------+-----+
|id |name |score|
+---+------+-----+
|1 |Bill |200 |
|2 |Bew |23 |
|3 |Amy |44 |
|4 |Ramond|68 |
+---+------+-----+
df2
+---+-------------------+
|id |message |
+---+-------------------+
|1 |i love Bill |
|2 |i hate Bill |
|3 |Bew go go ! |
|4 |Amy is the best |
|5 |Ramond is the wrost|
|6 |Bill go go |
|7 |Bill i love ya |
|8 |Ramond is Bad |
|9 |Amy is great |
+---+-------------------+
join
, groupBy
and count
should give your desired output as
df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false)
Final output would be
+------+-----+
|name |count|
+------+-----+
|Ramond|2 |
|Bill |4 |
|Amy |2 |
|Bew |1 |
+------+-----+
Upvotes: 2