Antony
Antony

Reputation: 1108

groupBy and get count of records for multiple columns in scala

As a part of big task I am facing some issues when I reach to find the count of records in each column grouping by another column. I am not much experienced in playing around with dataframe columns.

I am having a spark dataframe as below.

+---+------------+--------+--------+--------+
|id |        date|signal01|signal02|signal03|
+---+------------+--------+--------+--------+
|050|2021-01-14  |1       |3       |0       |
|050|2021-01-15  |1       |3       |0       |
|050|2021-02-02  |1       |3       |0       |
|051|2021-01-14  |1       |3       |0       |
|051|2021-01-15  |1       |3       |0       |
|051|2021-02-02  |1       |3       |0       |
|051|2021-02-03  |1       |3       |0       |
|052|2021-03-03  |1       |3       |0       |
|052|2021-03-05  |1       |3       |0       |
|052|2021-03-06  |1       |3       |0       |
|052|2021-03-16  |1       |3       |0       |

I am working in scala language to make use of this data frame and trying to get result as shown below.

+---+--------+--------+--------+
|id |signal01|signal02|signal03|
+---+--------+--------+--------+
|050|3       |3       |3       |
|051|4       |4       |4       |
|052|4       |4       |4       |
  1. for each Id, the count for each signal should be the output.
  2. And also is there any way we could pass condition to get the count, such as count of signals with value > 0?

I have tried something below, getting total count ,but not grouped with Id which was not expected.

val signalColumns = ((Temp01DF.columns.toBuffer) -= ("id","date"))

val Temp02DF = Temp01DF.select(signalColumns.map(c => count(col(c)).alias(c)): _*).show()

+--------+--------+--------+
|signal01|signal02|signal03|
+--------+--------+--------+
|51      |51      |51      |

Is there any ways to achieve this in scala lang?

Upvotes: 0

Views: 1790

Answers (3)

Tai
Tai

Reputation: 11

For the first part, I found that the required result can be achieved this way:

val signalCount = df.groupBy("id")
  .agg(count("signal01"), count("signal02"), count("signal03"))

Make sure you have the spark functions imported:

import org.apache.spark.sql.functions._

Upvotes: 0

dsk
dsk

Reputation: 2003

A PySpark Solution

df = spark.createDataFrame([(50, 1, 3, 0),(50, 1, 3, 0), (50, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (51, 1, 3, 0), (52, 1, 3, 0),(52, 1, 3, 0), (52, 1, 3, 0), (52, 1, 3, 0)],[ "col1","col2", "col3", "col4"])
df.show()
df_grp = df.groupBy("col1").agg(F.count("col2").alias("col2"), F.count("col3").alias("col3"), F.count("col4").alias("col4"))
df_grp.show()

Output

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  50|   3|   3|   3|
|  51|   4|   4|   4|
|  52|   4|   4|   4|
+----+----+----+----+

Upvotes: 0

BlueSheepToken
BlueSheepToken

Reputation: 6099

You are probably looking for groupBy, agg and count.

You can do something like this:

// define some data
 val df = Seq(
    ("050", 1, 3, 0),
    ("050", 1, 3, 0),
    ("050", 1, 3, 0),
    ("051", 1, 3, 0),
    ("051", 1, 3, 0),
    ("051", 1, 3, 0),
    ("051", 1, 3, 0),
    ("052", 1, 3, 0),
    ("052", 1, 3, 0),
    ("052", 1, 3, 0),
    ("052", 1, 3, 0)
).toDF("id", "signal01", "signal02", "signal03")

val countColumns = Seq("signal01", "signal02", "signal03").map(c => count("*").as(c))

df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052|       4|       4|       4|
|051|       4|       4|       4|
|050|       3|       3|       3|
+---+--------+--------+--------+
*/

Instead of counting "*", you can have a predicate:

val countColumns = Seq("signal01", "signal02", "signal03").map(c => count(when(col(c) > 0, 1)).as(c))

df.groupBy("id").agg(countColumns.head, countColumns.tail: _*).show
/*
+---+--------+--------+--------+
| id|signal01|signal02|signal03|
+---+--------+--------+--------+
|052|       4|       4|       0|
|051|       4|       4|       0|
|050|       3|       3|       0|
+---+--------+--------+--------+
*/

Upvotes: 1

Related Questions