BigData Lover
BigData Lover

Reputation: 141

How to Load data from Azure SQL DB to Synapse Spark Datalake?

I am looking for a Solution through which we can load the data from Azure SQL DB to Azure Synapse Spark Datalake (not in dedicated pool).

The Requirements are:

  1. we have a csv file in which data is present. Currently we are updating or inserting the data into csv file which we are reading through the spark datalake and loading into dataframes.
  2. But rather than using a csv file we want to load our csv data into Azure SQL DB and in future if any new updates or insert happen we should do directly in Azure SQL DB only.
  3. Currently we are doing our transformations in Synapse using Pyspark and reading the File data through the spark tables which is in our lake database. we have put those csv files in our Synapse ADLS and reading data from there only.
  4. We want to make a connection from the Azure SQL DB to Synapse Spark Datalake. So in Future if any upsert happen in SQL DB those changes will also reflect in our table in Spark datalake and when we are loading those tables in our Synapse notebook as a dataframe it should always pick up the latest Data.

Thanks in Advance for your Responses

Upvotes: 0

Views: 1176

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11329

You can do it by following ways.

By connecting Azure SQL Database to Synapse notebook via JDBC connection:

First Go to SQL Database and in the connection strings, copy the JDBC credentials.

enter image description here

In this approach, for every new data you should have a column last_modified date which helps in getting new data.

Now in Synapse notebook use the following code.

jdbcHostname = "rakeshserver.database.windows.net"
 jdbcDatabase = "rakeshsqldatbase"
 jdbcPort = "1433"
 username = "rakeshadmin"
 password = "< your password >"
 jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
 connectionProperties = {
   "user" : username,
   "password" : password,
   "driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
 }
 pushdown_query = "(select * from ok where last_modified>='2022-10-01' and last_modified<='2022-11-01') ok2"
 df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
 display(df)

In the synapse SQL code, get the data from the table like above. Schedule this notebook every day by using Notebook activity. For the above dates to be dynamic, you can pass those dates from Notebook activity.

My Execution of dataframe:

enter image description here

By using Lookup to SQL and passing the JSON to Notebook activity as parameter:

First Use lookup query (use last_modified date) to get the desired results as json and then pass this output array as parameter to the Synapse notebook using Notebook activity. You can read that as dataframe in synapse code.

Upvotes: 0

Related Questions