Sameer Joshi
Sameer Joshi

Reputation: 391

Spark dataframe transformation to get counts of a particular value in a column

I have the following type of dataset

A B
on off
off on
on nc
nc off

I have shown only two columns but there are many A , B, C and so on..
Possible values of all these columns are on, off and nc.
I want to understand how to get the count of on for each of the column in a dataframe.
I want the output in this type of dataframe.

Features Count_on
A 3
B 4
C 6

and so on...
I tried to do the count of each of the column with on using following code.
val A_count_on = df_original.groupBy("A").count().filter("A=='on'").collect()
val A_value_on = A_count_on(0).getLong(1)
This is how I get value of on only in A column. Running the same code on multiple columns I will get the count of the rest of the columns where that column value is on.
However, I dont understand how I generate the result dataframe, where I get the column name of the original dataframe as the row data with the count in the second column.
Please help if there is any easy way to do this.

Upvotes: 1

Views: 1203

Answers (1)

Kafels
Kafels

Reputation: 4059

You can first use the stack function for all the desired columns and apply a pivot to your table:

val df = sc.parallelize(Seq(
  ("on", "off"),
  ("off", "on"),
  ("on", "nc"),
  ("nc", "off")
)).toDF("A", "B")

val dfStack = df.selectExpr("""stack(2, "A", A, "B", B) as (features, value)""")
// +--------+-----+
// |features|value|
// +--------+-----+
// |A       |on   |
// |B       |off  |
// |A       |off  |
// |B       |on   |
// |A       |on   |
// |B       |nc   |
// |A       |nc   |
// |B       |off  |
// +--------+-----+

val dfGroup = dfStack.groupBy("features").pivot("value", Seq("on", "off", "nc")).count()
dfGroup.sort("features").show(false)

Output

+--------+---+---+---+
|features|on |off|nc |
+--------+---+---+---+
|A       |2  |1  |1  |
|B       |1  |2  |1  |
+--------+---+---+---+

Upvotes: 1

Related Questions