noobtoPro
noobtoPro

Reputation: 59

Count the number of keys for each value of a set tagged to that key

I have a pair RDD like this:

id   value
id1  set(1232, 3,1,93,35)
id2  set(321,42,5,13)
id3  set(1233,3,5)
id4  set(1232, 56,3,35,5)

Now, I want to get the total count of ids per value contained in the set. So the output for the above table should be something like this:

set value  count
    1232   2
    1      1
    93     1
    35     2
    3      3
    5      3
    321    1
    42     1
    13     1
    1233   1
    56     1

Is there a way to achieve this?

Upvotes: 1

Views: 162

Answers (2)

Chandan Ray
Chandan Ray

Reputation: 2091

yourrdd.toDF().withColumn(“_2”,explode(col(“_2”))).groupBy(“_2”).count.show

Upvotes: 0

Shaido
Shaido

Reputation: 28322

I would recommend using the dataframe API since it is easier and more understandable. Using this API, the problem can be solved by using explode and groupBy as follows:

df.withColumn("value", explode($"value"))
  .groupBy("value")
  .count()

Using an RDD instead, one possible solution is using flatMap and aggregateByKey:

rdd.flatMap(x => x._2.map(s => (s, x._1)))
  .aggregateByKey(0)((n, str) => n + 1, (p1, p2) => p1 + p2)

The result is the same in both cases.

Upvotes: 2

Related Questions