Does SparkSQL support subquery?

Question

I am running this query in Spark shell but it gives me error,

sqlContext.sql(
 "select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)

error:

java.lang.RuntimeException: [1.47] failure: ``)'' expected but identifier MAX found

select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) Can anybody explan me,thanks

zero323 · Accepted Answer

Planned features:

SPARK-23945 (Column.isin() should accept a single-column DataFrame as input).
SPARK-18455 (General support for correlated subquery processing).

Spark 2.0+

Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite for details. Some examples include:

select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)

select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)

Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame DSL.

Spark < 2.0

Spark supports subqueries in the FROM clause (same as Hive <= 0.12).

SELECT col FROM (SELECT *  FROM t1 WHERE bar) t2

It simply doesn't support subqueries in the WHERE clause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.

Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOIN there is no loss-of-function here.

Does SparkSQL support subquery?

Answers (2)

Related Questions