Spark - Find total occurrence of each distinct value in two different columns

Question

I want to find the total occurrence of each distinct value in both col1 and col2

col1   col2    
a        a      
a        c

Result

Id       count
a          3    
c          1

If this is SQL, I will do something like this

  select id, sum(count) from ( Select  col1 as id, count(*) as count from table group by col1 union all Select  col2 as id, count(*) as count from table group by col2) group by id

I am new to Spark. I need help to find an efficient way to do it.

Vincent Doba · Accepted Answer

You can use explode function on an array that contains the two columns, as follow:

input
  .select("col1", "col2")
  .withColumn("Id", explode(array(col("col1"), col("col2"))))
  .groupBy("Id")
  .count()
  .show()

That will print:

+---+-----+
|Id |count|
+---+-----+
|c  |1    |
|a  |3    |
+---+-----+

This solution prevents spark to compute two times the input dataframe

Spark - Find total occurrence of each distinct value in two different columns

Answers (2)

Related Questions