Ayoub Omari
Ayoub Omari

Reputation: 906

How to get columns of a spark sql query without execution

I want to extract columns of a spark sql query without executing it. With parsePlan:

val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
logicalPlan.collect{
    case p: Project => p.projectList.map(_.name)
}.flatten

I was able to extract the list of columns. However it doesn't work in case of Select *, and throws an exception with the following message : An exception or error caused a run to abort: Invalid call to name on unresolved object, tree: *.

Upvotes: 1

Views: 1297

Answers (1)

werner
werner

Reputation: 14905

Without any form of execution it is not possible for Spark to determine the columns. For example if a table was loaded from a csv file

spark.read.option("header",true).csv("data.csv").createOrReplaceTempView("csvTable")

then the query

select * from csvTable

would not be able to read the column names without reading (at least the first line) of the csv file.

Extracting a bit of code from Spark's explain command the following lines get as close as possible to an answer to the question:

val logicalPlan: LogicalPlan = spark.sessionState.sqlParser.parsePlan("select * from csvTable")
val queryExecution: QueryExecution = spark.sessionState.executePlan(logicalPlan)
val outputAttr: Seq[Attribute] = queryExecution.sparkPlan.output
val colNames: Seq[String] = outputAttr.map(a => a.name)
println(colNames)

If the file data.csv contains the columns a and b the code prints

List(a, b)


Disclaimer: QueryExecution is not considered to be a public class that might be used by application developers. As of now (Spark version 2.4.5) the code above works, but it is not guaranteed to work in future versions.

Upvotes: 1

Related Questions