Reputation: 155
I have a big spark sql statement which I'm trying to break into smaller chunks for better code readability. I do not want to join it but just merge the result.
Current working sql statement-
val dfs = x.map(field => spark.sql(s"
select ‘test’ as Table_Name,
'$field' as Column_Name,
min($field) as Min_Value,
max($field) as Max_Value,
approx_count_distinct($field) as Unique_Value_Count,
(
SELECT 100 * approx_count_distinct($field)/count(1)
from tempdftable
) as perc
from tempdftable
”))
I'm trying to take the below query out of the above sql
(SELECT 100 * approx_count_distinct($field)/count(1) from tempdftable) as perc
with this logic -
val Perce = x.map(field => spark.sql(s"(SELECT 100 * approx_count_distinct($field)/count(1) from parquetDFTable)"))
and later merge this val Perce with the very first big SQL statement with below statement but it is not working -
val dfs = x.map(field => spark.sql(s"
select ‘test’ as Table_Name,
'$field' as Column_Name,
min($field) as Min_Value,
max($field) as Max_Value,
approx_count_distinct($field) as Unique_Value_Count,
'"+Perce+ "'
from tempdftable
”))
How do we write this ?
Upvotes: 0
Views: 1370
Reputation: 13154
Why not go all in and convert your entire expression to Spark code?
import spark.implicits._
import org.apache.spark.sql.functions._
val fraction = udf((approxCount: Double, totalCount: Double) => 100 * approxCount/totalCount)
val fields = Seq("colA", "colB", "colC")
val dfs = fields.map(field => {
tempdftable
.select(min(field) as "Min_Value", max(field) as "Max_Value", approx_count_distinct(field) as "Unique_Value_Count", count(field) as "Total_Count")
.withColumn("Table_Name", lit("test"))
.withColumn("Column_Name", lit(field))
.withColumn("Perc", fraction('Unique_Value_Count, 'Total_Count))
.select('Table_Name, 'Column_Name, 'Min_Value, 'Max_Value, 'Unique_Value_Count, 'Perc)
})
val df = dfs.reduce(_ union _)
On a test example like this:
val tempdftable = spark.sparkContext.parallelize(List((3.0, 7.0, 2.0), (1.0, 4.0, 10.0), (3.0, 7.0, 2.0), (5.0, 0.0, 2.0))).toDF("colA", "colB", "colC")
tempdftable.show
+----+----+----+
|colA|colB|colC|
+----+----+----+
| 3.0| 7.0| 2.0|
| 1.0| 4.0|10.0|
| 3.0| 7.0| 2.0|
| 5.0| 0.0| 2.0|
+----+----+----+
We get
df.show
+----------+-----------+---------+---------+------------------+----+
|Table_Name|Column_Name|Min_Value|Max_Value|Unique_Value_Count|Perc|
+----------+-----------+---------+---------+------------------+----+
| test| colA| 1.0| 5.0| 3|75.0|
| test| colB| 0.0| 7.0| 3|75.0|
| test| colC| 2.0| 10.0| 2|50.0|
+----------+-----------+---------+---------+------------------+----+
Upvotes: 2