John
John

Reputation: 4006

How to get counts for null, not null, distinct values, all rows for all columns in a sql table?

I'm currently looking to get a table that gets counts for null, not null, distinct values, and all rows for all columns in a given table. This happens to be in Databricks (Apache Spark).

Something that looks like what is shown below.

enter image description here

I know I can do this with something like the SQL shown below. Also, I can use something like Java or Python, etc., to generate the SQL.

The question is:

select 
  1 col_position,
  'location_id' col_name,
  count(*) all_records,
  count(location_id) not_null,
  count(*) - count(location_id) null,
  count(distinct location_id) distinct_values
from
  admin
  
union
select 
  2 col_position,
  'location_zip' col_name,
  count(*) all_records,
  count(location_zip) not_null,
  count(*) - count(location_zip) null,
  count(distinct location_zip) distinct_values
from
  admin
  
union
select 
  3 col_position,
  'provider_npi' col_name,
  count(*) all_records,
  count(provider_npi) not_null,
  count(*) - count(provider_npi) null,
  count(distinct provider_npi) distinct_values
from
  admin
  
order by col_position
;

Upvotes: 0

Views: 3031

Answers (2)

use count (ifnull(field,0)) total_count. This will count all non-null rows.

Upvotes: 0

blackbishop
blackbishop

Reputation: 32640

As said in the comments, using UNION ALL should be efficient.

Using SQL

To avoid taping all the columns and sub queries, you can generate the SQL query from the list of columns like this:

val df = spark.sql("select * from admin")

// generate the same query from the columns list
val sqlQuery =
  df.columns.zipWithIndex.map { case (c, i) =>
    Seq(
      s"$i col_position",
      s"$c col_name",
      "count(*) all_records",
      s"count($c) not_null",
      s"count(*) - count($c) null",
      s"count(distinct $c) distinct_values"
    ).mkString("select ", ", ", " from admin")
  }.mkString("", " union all\n", "order by col_position")

spark.sql(sqlQuery).show

Using DataFrame (Scala)

There are some optimizations you can do by using DataFrame, like calculate count(*) one time, avoid typing all the column names, and the possibility to use caching.

Example input DataFrame :

//+---+---------+--------+---------------------+------+
//|id |firstName|lastName|email                |salary|
//+---+---------+--------+---------------------+------+
//|1  |michael  |armbrust|[email protected]|100K  |
//|2  |xiangrui |meng    |[email protected]|120K  |
//|3  |matei    |null    |[email protected]|140K  |
//|4  |null     |wendell |null                 |160K  |
//|5  |michael  |jackson |[email protected]  |null  |
//+---+---------+--------+---------------------+------+

First, get count and column list :

val cols = df.columns
val allRecords = df.count

Then, calculate each metric by looping through the columns list (you can create a function for each metric for example) :

val nullsCountDF = df.select(
  (Seq(expr("'nulls' as metric")) ++ cols.map(c =>
    sum(when(col(c).isNull, lit(1)).otherwise(lit(0))).as(c)
  )): _*
)
val distinctCountDF = df.select(
  (Seq(expr("'distinct_values' as metric")) ++ cols.map(c =>
    countDistinct(c).as(c)
  )): _*
)
val maxDF = df.select(
  (Seq(expr("'max_value' as metric")) ++ cols.map(c => max(c).as(c))): _*
)
val minDF = df.select(
  (Seq(expr("'min_value' as metric")) ++ cols.map(c => min(c).as(c))): _*
)

val allRecordsDF = spark.sql("select 'all_records' as metric," + cols.map(c => s"$allRecords as $c").mkString(","))

Finally, union the data frames created above:

val metricsDF = Seq(allRecordsDF, nullsCountDF, distinctCountDF, maxDF, minDF).reduce(_ union _)

metricsDF.show

//+---------------+---+---------+--------+---------------------+------+
//|metric         |id |firstName|lastName|email                |salary|
//+---------------+---+---------+--------+---------------------+------+
//|all_records    |5  |5        |5       |5                    |5     |
//|nulls          |0  |1        |1       |1                    |1     |
//|distinct_values|5  |3        |4       |4                    |4     |
//|max_value      |5  |xiangrui |wendell |[email protected]|160K  |
//|min_value      |1  |matei    |armbrust|[email protected]|100K  |
//+---------------+---+---------+--------+---------------------+------+

Using DataFrame (Python)

For Python example, you can see my other answer.

Upvotes: 1

Related Questions