Reputation: 23
I want to find the total occurrence of each distinct value in both col1 and col2
col1 col2
a a
a c
Result
Id count
a 3
c 1
If this is SQL, I will do something like this
select id, sum(count) from ( Select col1 as id, count(*) as count from table group by col1 union all Select col2 as id, count(*) as count from table group by col2) group by id
I am new to Spark. I need help to find an efficient way to do it.
Upvotes: 2
Views: 338
Reputation: 5068
You can use explode function on an array that contains the two columns, as follow:
input
.select("col1", "col2")
.withColumn("Id", explode(array(col("col1"), col("col2"))))
.groupBy("Id")
.count()
.show()
That will print:
+---+-----+
|Id |count|
+---+-----+
|c |1 |
|a |3 |
+---+-----+
This solution prevents spark to compute two times the input dataframe
Upvotes: 0
Reputation: 45319
You can simply union
col1
and col2
as a new id
column, then group by it:
df.select(col("col1").alias("id"))
.unionAll(df.select(col("col2").alias("id")))
.groupBy("id")
.count()
.show()
Which prints:
+---+-----+
| id|count|
+---+-----+
| c| 1|
| a| 3|
+---+-----+
Upvotes: 2