Reputation: 11
I would like a Synapse notebook read ADLS blob data outside of the managed VNet, but I am getting 403 errors. (for both Managed Identity/UPN's)
java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD ...
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1200) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:519) at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1713) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:332) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:315)
The ADLS storage account has been configured to use a firewall. A third-party vendor needs a vanilla ADLS storage with no private endpoints to land HR data. It is a product. We do not want to provide anonymous access.
Current configurations:
ADLS logs show, "Azure Synapse Analytics ... blocked" Client IP Address: XXX.XXX.XXX.XXX
The Synapse managed VNet is not accessible and therefore I cannot grab the IP address. I can query the parquet files in the Synapse workspace using the Linked Services.
How can I run a Synapse notebook and query the Workday ADLS storage area?
filename = 'fact_payroll_timecard/*.parquet'
data_path = 'abfss://%s@%s.dfs.core.windows.net/%s/%s' % (raw_container_name, raw_account_name, rawpath,filename)
dfpt= spark.read.parquet(data_path)
Turning off IP address filtering briefly successfully returned data.
Adding the CallerIpAddress found in the Logs to the Network Firewall IP Address list worked as well. To get the IP address of the calling notebook, turn on Diagnostic Settings for the blob storage and run this query.
StorageBlobLogs
| where TimeGenerated > ago(3d)
I don't think this is a long-term solution as the Caller IP address will change.
Upvotes: 0
Views: 287
Reputation: 3170
When the notebook is executed via the pipeline, the workspace managed service identity (MSI) is utilized.
Step 1: Ensure the workspace MSI has the necessary permissions to access the storage account data. The simplest way to achieve this is by assigning the workspace MSI to the Storage Blob Data Contributor role on the storage account.
Step 2: If the firewall is enabled on the storage account, follow these instructions: Configure Azure Storage firewalls and virtual networks
Here is an example where the firewall is enabled on the storage account:
When you grant access to trusted Azure services within the storage networking settings, you provide the following types of access:
Learn more about Connect to a secure storage account from your Azure Synapse workspace – Azure Synapse Analytics
Step 3: Configure the Linked Service
Open Synapse Studio and set up the Linked Service to use the workspace MSI:
Step 4: Update the Notebook Code to Utilize the Linked Service Configuration
val linked_service_name = "LinkedServerName"
// replace with your linked service name
%%spark
// Allow SPARK to access from Blob remotely
val sc = spark.sparkContext
spark.conf.set("spark.storage.synapse.linkedServiceName", linked_service_name)
spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
//replace the container and storage account names
val df = "abfss://[email protected]/"
print("Remote blob path: " + df)
mssparkutils.fs.ls(df)
Reference: Using the workspace MSI to authenticate a Synapse notebook when accessing an Azure Storage account
Upvotes: 0