Reputation: 507
I there an easy way to call sql on multiple column on spark sql.
For example, Let's say I have a query that should be applied to most columns
select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
If there are multiple columns, is there a way to execute the query for all the columns, and get result one time.
Similar to how df.describe does.
Upvotes: 1
Views: 824
Reputation: 5792
Use the meta data (columns in this case) included in your dataframe (which you can get via spark.table("<table_name>")
if you don't have it in scope already to get the column names, then apply the functions you want and pass to df.select
(or df.selectExpr
).
Build some test data:
scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()
scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })
scala> val df = seq.toDF("id", "some_int", "some_float")
Denote some functions we want to run on all the columns:
scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
Setup the final Seq of SQL Columns:
scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
Iterate over the columns and functions to apply to populate the select_columns Seq:
scala> val cols = df.columns
scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
Run the actual query:
scala> df.select(select_columns:_*).show
+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
| 1| 1000| -2143898568| 2147289642| 1.8781424E-4| 0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+
Upvotes: 1