Implementing Hive UNION in Pyspark

Question

I am trying to read a SQL from a file and run it inside a Pyspark job. The SQL is structured as below:

select 
sort by rand()
limit 333333 
UNION ALL
select 
sort by rand()
limit 666666

here is the error I am getting when I run it:

pyspark.sql.utils.ParseException: u" mismatched input 'UNION' expecting {, '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}

Is this because UNION ALL/UNION is not supported by spark SQL or something to do with the with parsing gone wrong?

pratiklodha · Accepted Answer

PySpark and Hive supports UNION in the sql statement. I am able to run the following hive statement

(SELECT * from x ORDER BY rand() LIMIT 50)
UNION
(SELECT * from y ORDER BY rand() LIMIT 50)

In pyspark you can also do this

df1=spark.sql('SELECT * from x ORDER BY rand() LIMIT 50')
df2=spark.sql('SELECT * from y ORDER BY rand() LIMIT 50')
df=df1.union(df2)

Implementing Hive UNION in Pyspark

Answers (1)

Related Questions