Reputation: 3
I am trying to read a SQL from a file and run it inside a Pyspark job. The SQL is structured as below:
select <statements>
sort by rand()
limit 333333
UNION ALL
select <statements>
sort by rand()
limit 666666
here is the error I am getting when I run it:
pyspark.sql.utils.ParseException: u"\nmismatched input 'UNION' expecting {, '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}
Is this because UNION ALL/UNION is not supported by spark SQL or something to do with the with parsing gone wrong?
Upvotes: 0
Views: 1324
Reputation: 1125
PySpark and Hive supports UNION in the sql statement. I am able to run the following hive statement
(SELECT * from x ORDER BY rand() LIMIT 50)
UNION
(SELECT * from y ORDER BY rand() LIMIT 50)
In pyspark you can also do this
df1=spark.sql('SELECT * from x ORDER BY rand() LIMIT 50')
df2=spark.sql('SELECT * from y ORDER BY rand() LIMIT 50')
df=df1.union(df2)
Upvotes: 1