FAA
FAA

Reputation: 179

How to set up authorization of Delta Live Tables to access Azure Data Lake files?

I am writing delta live tables notebooks in sql to access files from the data lake something like this:

CREATE OR REFRESH STREAMING LIVE TABLE MyTable
AS SELECT * FROM cloud_files("DataLakeSource/MyTableFiles", "parquet", map("cloudFiles.inferColumnTypes", "true"))

Whenever I need to access the Azure Data Lake I usually do something like this to set up the access:

service_credential = dbutils.secrets.get(scope="myscope",key="mykey")

spark.conf.set("fs.azure.account.auth.type.mylake.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.mylake.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.MyLake.dfs.core.windows.net", "99999999-9999-9999-9999-999999999")
spark.conf.set("fs.azure.account.oauth2.client.secret.mylake.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.mylake.dfs.core.windows.net", "https://login.microsoftonline.com/99999999-9999-9999-9999-9999999999/oauth2/token")

Since I can't add a python cell like I have above to set up the access inside a sql delta live table notebook, how/where do I add the configuration for access to the data lake files?

I've thought about adding config info to the pipeline under configuration, but that of course won't work with the call to dbutils.secrets.get

Upvotes: 2

Views: 1956

Answers (3)

Anupam Chand
Anupam Chand

Reputation: 2687

After a lot of searching, I finally found the necessary documentation that states on how to do this. Am adding the answer here to benefit anyone else with the same question. You need to update the pipeline definition JSON as shown in this link. In my example, I have used account key, but you can use the same method of any other secrets. Your cluster definition will look like this.

"clusters": [
        {
            "label": "default",
            "node_type_id": "Standard_DS3_v2",
            "driver_node_type_id": "Standard_DS3_v2",
            "num_workers": 0
        },
        {
            "label": "updates",
            "spark_conf": {
                "spark.hadoop.fs.azure.account.key.<storage_acct>.dfs.core.windows.net": "{{secrets/<scope_name>/<secret_name>}}"
            }
        }
    ],

You will need repeat this for your maintenance cluster definition.

Upvotes: 0

Jacek Laskowski
Jacek Laskowski

Reputation: 74749

When creating your Delta Live Tables pipeline use two notebooks:

  1. The SQL notebook with CREATE OR REFRESH STREAMING LIVE TABLE MyTable definition
  2. The Python notebook with the service_credential and fs.azure.account properties

The DLT runtime should be able to resolve the order of the notebooks and fire up authorization.


Alex Ott's comment seems correct:

You need to provide this configuration as part of the pipeline definition.

There'd be no dependency between the two notebooks (one with SQL and the other with spark.conf.sets or even SETs, so the DLT runtime couldn't choose one over the other as the first to execute and hence set the properties.


What's even more interesting (that I didn't really know about while answering this question) is the following (found in Configure pipeline settings for Delta Live Tables):

You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.

And then in Configure your compute settings:

Compute settings in the Delta Live Tables UI primarily target the default cluster used for pipeline updates. If you choose to specify a storage location that requires credentials for data access, you must ensure that the maintenance cluster also has these permissions configured.

Delta Live Tables provides similar options for cluster settings as other compute on Databricks. Like other pipeline settings, you can modify the JSON configuration for clusters to specify options not present in the UI

In other words, you have to use Delta Live Tables API or alike (Databricks Terraform provider) that gives you access to cluster-related settings.

Configure S3 access with instance profiles

Another option seems Configure S3 access with instance profiles that requires that you "have sufficient privileges in the AWS account containing your Databricks workspace, and be a Databricks workspace administrator."

enter image description here

Upvotes: 0

FAA
FAA

Reputation: 179

You can create a separate notebook with the connection information in it and call it first, then call the SQL delta live table notebook.

Upvotes: 1

Related Questions