Reputation: 4618
I'm trying to switch from pandas to pyspark and usually when I did my analysis I used pd.read_sql_query
to read the data needed for the analysis from a redshift database.
Example:
query = '''
SELECT id, SUM(value)
FROM table
GROUP BY id
'''
df = pd.read_sql_query(query, engine)
Is there any equivalent function in PySpark? Something that receives a query and a SQLAlchemy engine and returns the result of the query? If not, what is the best way to get a result of a SQL query in pyspark?
I tried to find something in pyspark.SQLContext
but didn't find anything useful.
Upvotes: 1
Views: 3177
Reputation: 31490
use spark.sql()
API to run your query.
Example:
query='select 1'
spark.sql(query).show()
#+---+
#| 1|
#+---+
#| 1|
#+---+
To run the query on any RDBMS
db's then use spark.read.format("jdbc")
to establish connection and execute your query.
spark.read.format("jdbc").option(...).load()
Upvotes: 1