Yugaraj P
Yugaraj P

Reputation: 23

Spark - Find total occurrence of each distinct value in two different columns

I want to find the total occurrence of each distinct value in both col1 and col2

col1   col2    
a        a      
a        c

Result

Id       count
a          3    
c          1   

   

If this is SQL, I will do something like this

  select id, sum(count) from ( Select  col1 as id, count(*) as count from table group by col1 union all Select  col2 as id, count(*) as count from table group by col2) group by id

I am new to Spark. I need help to find an efficient way to do it.

Upvotes: 2

Views: 338

Answers (2)

Vincent Doba
Vincent Doba

Reputation: 5068

You can use explode function on an array that contains the two columns, as follow:

input
  .select("col1", "col2")
  .withColumn("Id", explode(array(col("col1"), col("col2"))))
  .groupBy("Id")
  .count()
  .show()

That will print:

+---+-----+
|Id |count|
+---+-----+
|c  |1    |
|a  |3    |
+---+-----+

This solution prevents spark to compute two times the input dataframe

Upvotes: 0

ernest_k
ernest_k

Reputation: 45319

You can simply union col1 and col2 as a new id column, then group by it:

df.select(col("col1").alias("id"))
  .unionAll(df.select(col("col2").alias("id")))
  .groupBy("id")
  .count()
  .show()

Which prints:

+---+-----+
| id|count|
+---+-----+
|  c|    1|
|  a|    3|
+---+-----+

Upvotes: 2

Related Questions