Reputation: 4006
I'm currently looking to get a table that gets counts for null, not null, distinct values, and all rows for all columns in a given table. This happens to be in Databricks (Apache Spark).
Something that looks like what is shown below.
I know I can do this with something like the SQL shown below. Also, I can use something like Java or Python, etc., to generate the SQL.
The question is:
Is this the most efficient approach?
Is there a better way to write this query (less typing and/or more efficient)?
select
1 col_position,
'location_id' col_name,
count(*) all_records,
count(location_id) not_null,
count(*) - count(location_id) null,
count(distinct location_id) distinct_values
from
admin
union
select
2 col_position,
'location_zip' col_name,
count(*) all_records,
count(location_zip) not_null,
count(*) - count(location_zip) null,
count(distinct location_zip) distinct_values
from
admin
union
select
3 col_position,
'provider_npi' col_name,
count(*) all_records,
count(provider_npi) not_null,
count(*) - count(provider_npi) null,
count(distinct provider_npi) distinct_values
from
admin
order by col_position
;
Upvotes: 0
Views: 3031
Reputation: 4233
use count (ifnull(field,0)) total_count
. This will count all non-null rows.
Upvotes: 0
Reputation: 32640
As said in the comments, using UNION ALL
should be efficient.
To avoid taping all the columns and sub queries, you can generate the SQL query from the list of columns like this:
val df = spark.sql("select * from admin")
// generate the same query from the columns list
val sqlQuery =
df.columns.zipWithIndex.map { case (c, i) =>
Seq(
s"$i col_position",
s"$c col_name",
"count(*) all_records",
s"count($c) not_null",
s"count(*) - count($c) null",
s"count(distinct $c) distinct_values"
).mkString("select ", ", ", " from admin")
}.mkString("", " union all\n", "order by col_position")
spark.sql(sqlQuery).show
There are some optimizations you can do by using DataFrame, like calculate count(*)
one time, avoid typing all the column names, and the possibility to use caching.
Example input DataFrame :
//+---+---------+--------+---------------------+------+
//|id |firstName|lastName|email |salary|
//+---+---------+--------+---------------------+------+
//|1 |michael |armbrust|[email protected]|100K |
//|2 |xiangrui |meng |[email protected]|120K |
//|3 |matei |null |[email protected]|140K |
//|4 |null |wendell |null |160K |
//|5 |michael |jackson |[email protected] |null |
//+---+---------+--------+---------------------+------+
First, get count and column list :
val cols = df.columns
val allRecords = df.count
Then, calculate each metric by looping through the columns list (you can create a function for each metric for example) :
val nullsCountDF = df.select(
(Seq(expr("'nulls' as metric")) ++ cols.map(c =>
sum(when(col(c).isNull, lit(1)).otherwise(lit(0))).as(c)
)): _*
)
val distinctCountDF = df.select(
(Seq(expr("'distinct_values' as metric")) ++ cols.map(c =>
countDistinct(c).as(c)
)): _*
)
val maxDF = df.select(
(Seq(expr("'max_value' as metric")) ++ cols.map(c => max(c).as(c))): _*
)
val minDF = df.select(
(Seq(expr("'min_value' as metric")) ++ cols.map(c => min(c).as(c))): _*
)
val allRecordsDF = spark.sql("select 'all_records' as metric," + cols.map(c => s"$allRecords as $c").mkString(","))
Finally, union the data frames created above:
val metricsDF = Seq(allRecordsDF, nullsCountDF, distinctCountDF, maxDF, minDF).reduce(_ union _)
metricsDF.show
//+---------------+---+---------+--------+---------------------+------+
//|metric |id |firstName|lastName|email |salary|
//+---------------+---+---------+--------+---------------------+------+
//|all_records |5 |5 |5 |5 |5 |
//|nulls |0 |1 |1 |1 |1 |
//|distinct_values|5 |3 |4 |4 |4 |
//|max_value |5 |xiangrui |wendell |[email protected]|160K |
//|min_value |1 |matei |armbrust|[email protected]|100K |
//+---------------+---+---------+--------+---------------------+------+
For Python example, you can see my other answer.
Upvotes: 1