Bruno Mello
Bruno Mello

Reputation: 4618

PySpark equivalent of pandas read_sql_query

I'm trying to switch from pandas to pyspark and usually when I did my analysis I used pd.read_sql_query to read the data needed for the analysis from a redshift database.

Example:

query = '''
SELECT id, SUM(value)
FROM table
GROUP BY id
'''

df = pd.read_sql_query(query, engine)

Is there any equivalent function in PySpark? Something that receives a query and a SQLAlchemy engine and returns the result of the query? If not, what is the best way to get a result of a SQL query in pyspark?

I tried to find something in pyspark.SQLContext but didn't find anything useful.

Upvotes: 1

Views: 3177

Answers (1)

notNull
notNull

Reputation: 31490

use spark.sql() API to run your query.

Example:

query='select 1'
spark.sql(query).show()
#+---+
#|  1|
#+---+
#|  1|
#+---+

To run the query on any RDBMS db's then use spark.read.format("jdbc") to establish connection and execute your query.

spark.read.format("jdbc").option(...).load()

Upvotes: 1

Related Questions