How to run .sql file in PySpark

Question

I am able to run a single SQL query using PySpark. For example

spark = SparkSession.builder.appNmae("___").getOrCreate()
df = spark.sql("select * from ...")

But I have a .sql file full of many queries that I want to run. Do I have to execute them one at a time individually, or is there an easy way to run all of them?

Example: 3 rows, 3 columns in table1 and same in table2

.sql file

select * from table1;

select * from table2;

Output df:

or

df1

and df2

   A  B  C
0  a  b  c
1  d  e  f
2  g  h  i

mck · Accepted Answer

If you want to combine all query results into a list of dataframes (assuming each line is one single query)

with open('/path/to/file.sql', 'r') as f:
    query = f.readlines()

dfs = []
for line in query:
    dfs.append(spark.sql(line))

If you want to combine all dataframes (assuming that they all have the same schema)

from functools import reduce

df = reduce(lambda x, y: x.union(y), dfs)

How to run .sql file in PySpark

Answers (1)

Related Questions