Rinku Buragohain
Rinku Buragohain

Reputation: 207

Does SparkSQL support subquery?

I am running this query in Spark shell but it gives me error,

sqlContext.sql(
 "select sal from samplecsv where sal < (select MAX(sal) from samplecsv)"
).collect().foreach(println)

error:

java.lang.RuntimeException: [1.47] failure: ``)'' expected but identifier MAX found

select sal from samplecsv where sal < (select MAX(sal) from samplecsv) ^ at scala.sys.package$.error(package.scala:27) Can anybody explan me,thanks

Upvotes: 17

Views: 56375

Answers (2)

zero323
zero323

Reputation: 330303

Planned features:

  • SPARK-23945 (Column.isin() should accept a single-column DataFrame as input).
  • SPARK-18455 (General support for correlated subquery processing).

Spark 2.0+

Spark SQL should support both correlated and uncorrelated subqueries. See SubquerySuite for details. Some examples include:

select * from l where exists (select * from r where l.a = r.c)
select * from l where not exists (select * from r where l.a = r.c)

select * from l where l.a in (select c from r)
select * from l where a not in (select c from r)

Unfortunately as for now (Spark 2.0) it is impossible to express the same logic using DataFrame DSL.

Spark < 2.0

Spark supports subqueries in the FROM clause (same as Hive <= 0.12).

SELECT col FROM (SELECT *  FROM t1 WHERE bar) t2

It simply doesn't support subqueries in the WHERE clause.Generally speaking arbitrary subqueries (in particular correlated subqueries) couldn't be expressed using Spark without promoting to Cartesian join.

Since subquery performance is usually a significant issue in a typical relational system and every subquery can be expressed using JOIN there is no loss-of-function here.

Upvotes: 42

Tagar
Tagar

Reputation: 14929

https://issues.apache.org/jira/browse/SPARK-4226

There is a pull request to implement that feature .. my guess it might land in Spark 2.0.

Upvotes: 0

Related Questions