Reputation: 391
I have the following type of dataset
A | B |
---|---|
on | off |
off | on |
on | nc |
nc | off |
I have shown only two columns but there are many A , B, C and so on..
Possible values of all these columns are on, off and nc.
I want to understand how to get the count of on for each of the column in a dataframe.
I want the output in this type of dataframe.
Features | Count_on |
---|---|
A | 3 |
B | 4 |
C | 6 |
and so on...
I tried to do the count of each of the column with on using following code.
val A_count_on = df_original.groupBy("A").count().filter("A=='on'").collect()
val A_value_on = A_count_on(0).getLong(1)
This is how I get value of on only in A column. Running the same code on multiple columns I will get the count of the rest of the columns where that column value is on.
However, I dont understand how I generate the result dataframe, where I get the column name of the original dataframe as the row data with the count in the second column.
Please help if there is any easy way to do this.
Upvotes: 1
Views: 1203
Reputation: 4059
You can first use the stack
function for all the desired columns and apply a pivot to your table:
val df = sc.parallelize(Seq(
("on", "off"),
("off", "on"),
("on", "nc"),
("nc", "off")
)).toDF("A", "B")
val dfStack = df.selectExpr("""stack(2, "A", A, "B", B) as (features, value)""")
// +--------+-----+
// |features|value|
// +--------+-----+
// |A |on |
// |B |off |
// |A |off |
// |B |on |
// |A |on |
// |B |nc |
// |A |nc |
// |B |off |
// +--------+-----+
val dfGroup = dfStack.groupBy("features").pivot("value", Seq("on", "off", "nc")).count()
dfGroup.sort("features").show(false)
Output
+--------+---+---+---+
|features|on |off|nc |
+--------+---+---+---+
|A |2 |1 |1 |
|B |1 |2 |1 |
+--------+---+---+---+
Upvotes: 1