Reputation: 10966
One can convert a raw SQL string into a DataFrame. But is it also possible the other way around, i.e., get the SQL representation for the query logic of a (derived) Spark DataFrame?
// Source data
val a = Seq(7, 8, 9, 7, 8, 7).toDF("foo")
// Query using DataFrame functions
val b = a.groupBy($"foo").agg(count("*") as "occurrences").orderBy($"occurrences")
b.show()
// Convert a SQL string into a DataFrame
val sqlString = "SELECT foo, count(*) as occurrences FROM a GROUP BY foo ORDER BY occurrences"
a.createOrReplaceTempView("a")
val c = currentSparkSession.sql(sqlString)
c.show()
// "Convert" a DataFrame into a SQL string
b.toSQLString() // Error: This function does not exist.
Upvotes: 2
Views: 2267
Reputation: 18525
It is not possible to "convert" a DataFrame into an SQL string because Spark does not know how to write SQL queries and it does not need to.
I find it useful to recall how a Dataframe code or an SQL query gets handled by Spark. This is done by Spark's Catalyst Optimizer and it goes through four transformational phases as shown below:
In the first phase (Analysis), the Spark SQL engine generates an abstract syntax tree (AST) for the SQL or Dataframe query. This tree is the main data type in Catalyst (see section 4.1 in white paper Spark SQL: Relational Data Processing in Spark) and it is used to create the logical plan and eventually the physical plan. You get an representation of those plans if you use the explain
API that Spark offers.
Although it is clear to me what you mean with "One can convert a raw SQL string into a DataFrame" I guess it helps to be more precise. We are not converting an SQL string (hence you put quotations around that word yourself) into a Dataframe, but you applied your SQL knowledge as this is a syntax that can be parsed by Spark to understand your intentions. In addition, you cannot just type in any SQL query as this could still fail in the Analysis phase when it comes to the comparison with the Catalog. So, the SQL string is just an agreement on how Spark allows you to give instructions. This SQL query then gets parsed, transformed into an AST (as described above) and after going through the other three phases ending up in a RDD-based code. The result of this SQL execution through the sql
API returns a Dataframe, whereas you can easily transform it into an RDD with df.rdd
.
Overall, there is no need for Spark to write any code and in particular any Dataframe code into an SQL syntax which you could then get out of Spark. The AST is the internal abstraction and it is not required for Spark to convert Dataframe code first to an SQL query instead of directly converting the Dataframe code into an AST.
Upvotes: 5
Reputation: 737
No. There is no method that can get the SQL query from a dataframe. You will have to create the query yourself by looking at all the filters and select you used to create the dataframe.
Upvotes: 1