Reputation: 5914
I am trying to read a file mydir/mycsv.csv
from Azure Data Lake Storage Gen1 from a Databricks notebook, using the syntax (inspired by the documentation)
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "123abc-1e42-31415-9265-12345678",
"dfs.adls.oauth2.credential": dbutils.secrets.get(scope = "adla", key = "adlamaywork"),
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token"}
dbutils.fs.mount(
source = "adl://myadls.azuredatalakestore.net/mydir",
mount_point = "/mnt/adls",
extra_configs = configs)
post_processed = spark.read.csv("/mnt/adls/mycsv.csv").collect()
post_processed.head(10).to_csv("/dbfs/processed.csv")
dbutils.fs.unmount("/mnt/adls")
My client 123abc-1e42-31415-9265-12345678
has access to the Data Lake Storage myadls
and I have created secrets with
databricks secrets put --scope adla --key adlamaywork
When I execute the pyspark code above in the Databricks notebook, when accessing the csv file with spark.read.csv
, I get
com.microsoft.azure.datalake.store.ADLException: Error getting info for file /mydir/mycsv.csv
When navigating the dbfs with dbfs ls dbfs:/mnt/adls
, the parent mount point seems to be there, but I get
Error: b'{"error_code":"IO_ERROR","message":"Error fetching access token\nLast encountered exception thrown after 1 tries [HTTP0(null)]"}'
What am I doing wrong?
Upvotes: 3
Views: 1967
Reputation: 11
If you do not necessarily need to mount the directory into dbfs, you could try to read directly from adls, like this :
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.access.token.provider", "org.apache.hadoop.fs.adls.oauth2.ConfCredentialBasedAccessTokenProvider")
spark.conf.set("dfs.adls.oauth2.client.id", "123abc-1e42-31415-9265-12345678")
spark.conf.set("dfs.adls.oauth2.credential", dbutils.secrets.get(scope = "adla", key = "adlamaywork"))
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/123456abc-2718-aaaa-9999-42424242abc/oauth2/token")
csvFile = "adl://myadls.azuredatalakestore.net/mydir/mycsv.csv"
df = spark.read.format('csv').options(header='true', inferschema='true').load(csvFile)
Upvotes: 1