Reputation: 1357
I am able to run a single SQL query using PySpark. For example
spark = SparkSession.builder.appNmae("___").getOrCreate()
df = spark.sql("select * from ...")
But I have a .sql file full of many queries that I want to run. Do I have to execute them one at a time individually, or is there an easy way to run all of them?
Example: 3 rows, 3 columns in table1 and same in table2
.sql file
select * from table1;
select * from table2;
Output df:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 a b c
4 d e f
5 g h i
or
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
and df2
A B C
0 a b c
1 d e f
2 g h i
Upvotes: 0
Views: 3469
Reputation: 42422
If you want to combine all query results into a list of dataframes (assuming each line is one single query)
with open('/path/to/file.sql', 'r') as f:
query = f.readlines()
dfs = []
for line in query:
dfs.append(spark.sql(line))
If you want to combine all dataframes (assuming that they all have the same schema)
from functools import reduce
df = reduce(lambda x, y: x.union(y), dfs)
Upvotes: 1