Reputation: 397
I am a scala beginner. I am trying to find count of null values in a column of a table and add column name and count as key value pair in Map. The below code doesn't work as expected. Please guide me how I can modify this code to make it work
def nullCheck(databaseName:String,tableName:String) ={
var map = scala.collection.mutable.Map[String, Int]()
validationColumn = Array(col1,col2)
for(i <- 0 to validationColumn.length) {
val nullVal = spark.sql(s"select count(*) from $databaseName.$tableName where validationColumn(i) is NULL")
if(nullval == 0)
map(validationColumn(i)) = nullVal
map
}
The function should return ((col1,count),(col2,count)) as Map
Upvotes: 0
Views: 342
Reputation: 655
This can be done with creating a dynamic sql string and then mapping it. Your approach reads same data multiple times
Here is the solution. I used an "example" DataFrame.
scala> val inputDf = Seq((Some("Sam"),None,200),(None,Some(31),30),(Some("John"),Some(25),25),(Some("Harry"),None,100)).toDF("name","age","not_imp_column")
scala> inputDf.show(false)
+-----+----+--------------+
|name |age |not_imp_column|
+-----+----+--------------+
|Sam |null|200 |
|null |31 |30 |
|John |25 |25 |
|Harry|null|100 |
+-----+----+--------------+
and our ValidationColumns Are name
and age
where we shall count Nulls
we put them in a List
scala> val validationColumns = List("name","age")
And We Create a SQL String that will be driving this whole calculation
scala> val sqlStr = "select " + validationColumns.map(x => "sum(" + x + "_count) AS " + x + "_sum" ).mkString(",") + " from (select " + validationColumns.map(x => "case when " + x + " = '$$' then 1 else 0 end AS " + x + "_count").mkString(",") + " from " +" (select" + validationColumns.map(x => " nvl( " + x +",'$$') as " + x).mkString(",") + " from example_table where " + validationColumns.map(x => x + " is null ").mkString("or ") + " ) layer1 ) layer2 "
It resolves to ==>
"select sum(name_count) AS name_sum,sum(age_count) AS age_sum from (select case when name = '$$' then 1 else 0 end AS name_count,case when age = '$$' then 1 else 0 end AS age_count from (select nvl( name,'$$') as name, nvl( age,'$$') as age from example_table where name is null or age is null ) layer1 ) layer2 "
now we create a temporary view of our dataframe
inputDf.createOrReplaceTempView("example_table")
only thing left to do is execute the sql
and creating a Map
which is done by
validationColumns zip spark.sql(sqlStr).collect.map(_.toSeq).flatten.toList toMap
and result
Map(name -> 1, age -> 2)
// obviously you can make it type safe
Upvotes: 1