Spark dataframe transformation to get counts of a particular value in a column

Question

I have the following type of dataset

A	B
on	off
off	on
on	nc
nc	off

I have shown only two columns but there are many A , B, C and so on..
Possible values of all these columns are on, off and nc.
I want to understand how to get the count of on for each of the column in a dataframe.
I want the output in this type of dataframe.

Features	Count_on
A	3
B	4
C	6

and so on...
I tried to do the count of each of the column with on using following code.
val A_count_on = df_original.groupBy("A").count().filter("A=='on'").collect()
val A_value_on = A_count_on(0).getLong(1)
This is how I get value of on only in A column. Running the same code on multiple columns I will get the count of the rest of the columns where that column value is on.
However, I dont understand how I generate the result dataframe, where I get the column name of the original dataframe as the row data with the count in the second column.
Please help if there is any easy way to do this.

Kafels · Accepted Answer

You can first use the stack function for all the desired columns and apply a pivot to your table:

val df = sc.parallelize(Seq(
  ("on", "off"),
  ("off", "on"),
  ("on", "nc"),
  ("nc", "off")
)).toDF("A", "B")

val dfStack = df.selectExpr("""stack(2, "A", A, "B", B) as (features, value)""")
// +--------+-----+
// |features|value|
// +--------+-----+
// |A       |on   |
// |B       |off  |
// |A       |off  |
// |B       |on   |
// |A       |on   |
// |B       |nc   |
// |A       |nc   |
// |B       |off  |
// +--------+-----+

val dfGroup = dfStack.groupBy("features").pivot("value", Seq("on", "off", "nc")).count()
dfGroup.sort("features").show(false)

Output

+--------+---+---+---+
|features|on |off|nc |
+--------+---+---+---+
|A       |2  |1  |1  |
|B       |1  |2  |1  |
+--------+---+---+---+

Spark dataframe transformation to get counts of a particular value in a column

Answers (1)

Related Questions