Reputation: 1108
As a part of big task I am facing some issues when I reach to find the count of records in each column grouping by another column. I am not much experienced in playing around with dataframe columns.
I am having a spark dataframe as below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |0 |
|050|2021-01-15 |1 |3 |0 |
|050|2021-02-02 |1 |3 |0 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |1 |3 |0 |
|051|2021-02-02 |1 |3 |0 |
|051|2021-02-03 |1 |3 |0 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |1 |3 |0 |
|052|2021-03-06 |1 |3 |0 |
|052|2021-03-16 |1 |3 |0 |
I am working in scala language to make use of this data frame and trying to get result as shown below.
+---+--------+--------+--------+
|id |signal01|signal02|signal03|
+---+--------+--------+--------+
|050|3 |3 |3 |
|051|4 |4 |4 |
|052|4 |4 |4 |
I have tried something below, getting total count ,but not grouped with Id which was not expected.
val signalColumns = ((Temp01DF.columns.toBuffer) -= ("id","date"))
val Temp02DF = Temp01DF.select(signalColumns.map(c => count(col(c)).alias(c)): _*).show()
+--------+--------+--------+
|signal01|signal02|signal03|
+--------+--------+--------+
|51 |51 |51 |
Is there any ways to achieve this in scala lang?
Upvotes: 0
Views: 1790
Reputation: 11
For the first part, I found that the required result can be achieved this way:
val signalCount = df.groupBy("id")
.agg(count("signal01"), count("signal02"), count("signal03"))
Make sure you have the spark functions imported:
import org.apache.spark.sql.functions._
Upvotes: 0
Reputation: 2003
A PySpark Solution
df = spark.createDataFrame([(50, 1, 3, 0),(50, 1, 3, 0), (50, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (52, 1, 3, 0),(52, 1, 3, 0), (52, 1, 3, 0), (52, 1, 3, 0)],[ "col1","col2", "col3", "col4"])
df.show()
df_grp = df.groupBy("col1").agg(F.count("col2").alias("col2"), F.count("col3").alias("col3"), F.count("col4").alias("col4"))
df_grp.show()
Output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 50| 3| 3| 3|
| 51| 4| 4| 4|
| 52| 4| 4| 4|
+----+----+----+----+
Upvotes: 0
Reputation: 6099
You are probably looking for groupBy
, agg
and count
.
You can do something like this:
// define some data
val df = Seq(
("050", 1, 3, 0),
("050", 1, 3, 0),
("050", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("051", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0),
("052", 1, 3, 0)
).toDF("id", "signal01", "signal02", "signal03")
val countColumns = Seq("signal01", "signal02", "signal03").map(c => count("*").as(c))
df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052| 4| 4| 4|
|051| 4| 4| 4|
|050| 3| 3| 3|
+---+--------+--------+--------+
*/
Instead of counting "*"
, you can have a predicate:
val countColumns = Seq("signal01", "signal02", "signal03").map(c => count(when(col(c) > 0, 1)).as(c))
df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052| 4| 4| 0|
|051| 4| 4| 0|
|050| 3| 3| 0|
+---+--------+--------+--------+
*/
Upvotes: 1