formicaman
formicaman

Reputation: 1357

How to run .sql file in PySpark

I am able to run a single SQL query using PySpark. For example

spark = SparkSession.builder.appNmae("___").getOrCreate()
df = spark.sql("select * from ...")

But I have a .sql file full of many queries that I want to run. Do I have to execute them one at a time individually, or is there an easy way to run all of them?

Example: 3 rows, 3 columns in table1 and same in table2

.sql file

select * from table1;

select * from table2;

Output df:

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
3  a  b  c
4  d  e  f
5  g  h  i

or

df1

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

and df2

   A  B  C
0  a  b  c
1  d  e  f
2  g  h  i

Upvotes: 0

Views: 3469

Answers (1)

mck
mck

Reputation: 42422

If you want to combine all query results into a list of dataframes (assuming each line is one single query)

with open('/path/to/file.sql', 'r') as f:
    query = f.readlines()

dfs = []
for line in query:
    dfs.append(spark.sql(line))

If you want to combine all dataframes (assuming that they all have the same schema)

from functools import reduce

df = reduce(lambda x, y: x.union(y), dfs)

Upvotes: 1

Related Questions