shrewdbeaver
shrewdbeaver

Reputation: 25

How to count occurrences of different values in multiple columns all at once where number or name of columns is not known?

I have this dataframe that gets generated automatically and the names and numbers of columns will never be known. I would like to know how I can count the occurrence of each of the values in each of the columns.

For example,

        Col1  Col2   Col3
Row1    True  False  False
Row2    True  True   True
Row3    False False  True
Row4    False False  False

The result should be something like:

Col1 Count Col2 Count Col3 Count
True   2   True    1  True   2
False  2   False   3  False  2

I have tried applying GroupBy kind of like this:

df.groupBy(record => (record.Col1, record.Col2, record.Col3)).count().show

But this wouldn't work for me since I wouldn't know the number or names of the columns.

Upvotes: 0

Views: 718

Answers (1)

Som
Som

Reputation: 6323

Try this-

Load the test data provided

 val data =
      """
        |Col1  Col2   Col3
        |True  False  False
        |True  True   True
        |False False  True
        |False False  False
      """.stripMargin
    val stringDS2 = data.split(System.lineSeparator())
      .map(_.split("\\s+").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
      .toSeq.toDS()
    val df2 = spark.read
      .option("sep", "|")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS2)
    df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+-----+
      * |Col1 |Col2 |Col3 |
      * +-----+-----+-----+
      * |true |false|false|
      * |true |true |true |
      * |false|false|true |
      * |false|false|false|
      * +-----+-----+-----+
      *
      * root
      * |-- Col1: boolean (nullable = true)
      * |-- Col2: boolean (nullable = true)
      * |-- Col3: boolean (nullable = true)
      */

Simple way to compute the count for each distinct values in the column

 val findCounts = df2.columns.flatMap(c => Seq(col(c), count(c).over(Window.partitionBy(c)).as(s"count_$c")))
    df2.select(findCounts: _*).distinct()
      .show(false)
    /**
      * +-----+----------+-----+----------+-----+----------+
      * |Col1 |count_Col1|Col2 |count_Col2|Col3 |count_Col3|
      * +-----+----------+-----+----------+-----+----------+
      * |false|2         |false|3         |false|2         |
      * |false|2         |false|3         |true |2         |
      * |true |2         |false|3         |false|2         |
      * |true |2         |true |1         |true |2         |
      * +-----+----------+-----+----------+-----+----------+
      */

If you need in the same format as mentioned, try this

Assuming all the columns in the dataframe have same distinct values

// Assuming all the columns in the dataframe have same distinct values
    val columns = df2.columns
    val head = columns.head
    val zeroDF = df2.groupBy(head).agg(count(head).as(s"${head}_count"))
    columns.tail.foldLeft(zeroDF){
      (df, c) => df.join(df2.groupBy(c).agg(count(c).as(s"${c}_count")), col(head) === col(c))
    }.show(false)

    /**
      * +-----+----------+-----+----------+-----+----------+
      * |Col1 |Col1_count|Col2 |Col2_count|Col3 |Col3_count|
      * +-----+----------+-----+----------+-----+----------+
      * |false|2         |false|3         |false|2         |
      * |true |2         |true |1         |true |2         |
      * +-----+----------+-----+----------+-----+----------+
      */

Upvotes: 1

Related Questions