vvazza
vvazza

Reputation: 397

Check count of a column from a dataframe and and add column and count as Map

I am a scala beginner. I am trying to find count of null values in a column of a table and add column name and count as key value pair in Map. The below code doesn't work as expected. Please guide me how I can modify this code to make it work

def nullCheck(databaseName:String,tableName:String) ={
var map = scala.collection.mutable.Map[String, Int]() 
validationColumn = Array(col1,col2)
for(i <- 0 to validationColumn.length) {
val nullVal = spark.sql(s"select count(*) from $databaseName.$tableName where validationColumn(i) is NULL")
if(nullval == 0)
map(validationColumn(i)) = nullVal
map
} 

The function should return ((col1,count),(col2,count)) as Map

Upvotes: 0

Views: 342

Answers (1)

SanBan
SanBan

Reputation: 655

This can be done with creating a dynamic sql string and then mapping it. Your approach reads same data multiple times

Here is the solution. I used an "example" DataFrame.

scala> val inputDf = Seq((Some("Sam"),None,200),(None,Some(31),30),(Some("John"),Some(25),25),(Some("Harry"),None,100)).toDF("name","age","not_imp_column")

scala> inputDf.show(false)
+-----+----+--------------+
|name |age |not_imp_column|
+-----+----+--------------+
|Sam  |null|200           |
|null |31  |30            |
|John |25  |25            |
|Harry|null|100           |
+-----+----+--------------+

and our ValidationColumns Are name and age where we shall count Nulls we put them in a List

scala> val validationColumns = List("name","age")

And We Create a SQL String that will be driving this whole calculation

scala> val sqlStr = "select " + validationColumns.map(x => "sum(" + x + "_count) AS " + x + "_sum" ).mkString(",") + " from (select " + validationColumns.map(x => "case when " + x + " = '$$' then 1 else 0 end AS " + x + "_count").mkString(",") + " from " +" (select" + validationColumns.map(x => " nvl( " + x +",'$$') as " + x).mkString(",") + " from example_table where " + validationColumns.map(x => x + " is null ").mkString("or ") + " ) layer1 ) layer2 "

It resolves to ==>

"select sum(name_count) AS name_sum,sum(age_count) AS age_sum from (select case when name = '$$' then 1 else 0 end AS name_count,case when age = '$$' then 1 else 0 end AS age_count from (select nvl( name,'$$') as name, nvl( age,'$$') as age from example_table where name is null or age is null ) layer1 ) layer2 "

now we create a temporary view of our dataframe

inputDf.createOrReplaceTempView("example_table")

only thing left to do is execute the sql and creating a Map which is done by

validationColumns zip spark.sql(sqlStr).collect.map(_.toSeq).flatten.toList toMap

and result

Map(name -> 1, age -> 2) // obviously you can make it type safe

Upvotes: 1

Related Questions