Reputation: 59
I have a pair RDD like this:
id value
id1 set(1232, 3,1,93,35)
id2 set(321,42,5,13)
id3 set(1233,3,5)
id4 set(1232, 56,3,35,5)
Now, I want to get the total count of ids per value contained in the set. So the output for the above table should be something like this:
set value count
1232 2
1 1
93 1
35 2
3 3
5 3
321 1
42 1
13 1
1233 1
56 1
Is there a way to achieve this?
Upvotes: 1
Views: 162
Reputation: 2091
yourrdd.toDF().withColumn(“_2”,explode(col(“_2”))).groupBy(“_2”).count.show
Upvotes: 0
Reputation: 28322
I would recommend using the dataframe API since it is easier and more understandable. Using this API, the problem can be solved by using explode
and groupBy
as follows:
df.withColumn("value", explode($"value"))
.groupBy("value")
.count()
Using an RDD instead, one possible solution is using flatMap
and aggregateByKey
:
rdd.flatMap(x => x._2.map(s => (s, x._1)))
.aggregateByKey(0)((n, str) => n + 1, (p1, p2) => p1 + p2)
The result is the same in both cases.
Upvotes: 2