Yaga
Yaga

Reputation: 125

Databricks Spark conditional pull from Azure SQL

I'm trying to pull data from Azure SQL with Azure Databricks using a sql-spark-connecter. I've installed the Maven library on the cluster and have successfully run this script on Databricks to pull from a sample database:

server_name = "jdbc:sqlserver://(removed).database.windows.net"
database_name = (removed)
url = server_name + ";" + "databaseName=" + database_name + ";"
table_name = "dbo.Test"
username = (removed)
password = (removed)


jdbcDF = spark.read \
        .format("com.microsoft.sqlserver.jdbc.spark") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", username) \
        .option("password", password) \
        .load()

display(jdbcDF)

However, I want to eventually use this connector to pull from a large table (5 mil+ rows), and want to add a filter to the .read method in order to reduce the amount of rows I need to pull to save processing power and dataframe size. Ideally, I'd also want to pull only specific columns, but rows are the priority.

Looking at the available Pyspark documentation, there doesn't seem to be a generic .filter property, and the sql spark connector documentation doesn't specify an available .option keys that helps in filtering the request either. I am assuming there is not good way to reduce data pulls, but please correct me if there is.

I'll probably drop the unnecessary columns and rows from the dataframe immediately after querying to help with cost and performance. Let me know if there's other ideas to consider!

Upvotes: 0

Views: 704

Answers (1)

Ged
Ged

Reputation: 18003

Pass a query that will be executed that contains filter. E.g.

...
...
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, 
properties=connectionProperties)
display(df)
...

from https://docs.databricks.com/data/data-sources/sql-databases.html should help.

Upvotes: 1

Related Questions