Reputation: 25
I have this dataframe that gets generated automatically and the names and numbers of columns will never be known. I would like to know how I can count the occurrence of each of the values in each of the columns.
For example,
Col1 Col2 Col3
Row1 True False False
Row2 True True True
Row3 False False True
Row4 False False False
The result should be something like:
Col1 Count Col2 Count Col3 Count
True 2 True 1 True 2
False 2 False 3 False 2
I have tried applying GroupBy kind of like this:
df.groupBy(record => (record.Col1, record.Col2, record.Col3)).count().show
But this wouldn't work for me since I wouldn't know the number or names of the columns.
Upvotes: 0
Views: 718
Reputation: 6323
Try this-
val data =
"""
|Col1 Col2 Col3
|True False False
|True True True
|False False True
|False False False
""".stripMargin
val stringDS2 = data.split(System.lineSeparator())
.map(_.split("\\s+").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df2 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS2)
df2.show(false)
df2.printSchema()
/**
* +-----+-----+-----+
* |Col1 |Col2 |Col3 |
* +-----+-----+-----+
* |true |false|false|
* |true |true |true |
* |false|false|true |
* |false|false|false|
* +-----+-----+-----+
*
* root
* |-- Col1: boolean (nullable = true)
* |-- Col2: boolean (nullable = true)
* |-- Col3: boolean (nullable = true)
*/
val findCounts = df2.columns.flatMap(c => Seq(col(c), count(c).over(Window.partitionBy(c)).as(s"count_$c")))
df2.select(findCounts: _*).distinct()
.show(false)
/**
* +-----+----------+-----+----------+-----+----------+
* |Col1 |count_Col1|Col2 |count_Col2|Col3 |count_Col3|
* +-----+----------+-----+----------+-----+----------+
* |false|2 |false|3 |false|2 |
* |false|2 |false|3 |true |2 |
* |true |2 |false|3 |false|2 |
* |true |2 |true |1 |true |2 |
* +-----+----------+-----+----------+-----+----------+
*/
Assuming all the columns in the dataframe have same distinct values
// Assuming all the columns in the dataframe have same distinct values
val columns = df2.columns
val head = columns.head
val zeroDF = df2.groupBy(head).agg(count(head).as(s"${head}_count"))
columns.tail.foldLeft(zeroDF){
(df, c) => df.join(df2.groupBy(c).agg(count(c).as(s"${c}_count")), col(head) === col(c))
}.show(false)
/**
* +-----+----------+-----+----------+-----+----------+
* |Col1 |Col1_count|Col2 |Col2_count|Col3 |Col3_count|
* +-----+----------+-----+----------+-----+----------+
* |false|2 |false|3 |false|2 |
* |true |2 |true |1 |true |2 |
* +-----+----------+-----+----------+-----+----------+
*/
Upvotes: 1