Marc
Marc

Reputation: 123

Pyspark with DBUtils

I am trying to use DBUtils and Pyspark from a jupyter notebook python script (running on Docker) to access an Azure Data Lake Blob. However, I can't seem to get dbutils to be recognized (i.e. NameError: name 'dbutils' is not defined). I've tried explicitly importing DBUtils, as well as not importing it as I read:

"An important point to remember is to never run import dbutils in your Python script. This command succeeds but clobbers all the commands so nothing works. It is imported by default." Link

I've also tried the solution posted here, but it still threw "KeyError: 'dbutils'"

spark.conf.set('fs.azure.account.key.<storage account>.blob.core.windows.net', <storage account access key>)
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container>@<storage account>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Does anyone have a solution to this?

Upvotes: 0

Views: 4128

Answers (1)

Hossein Sarshar
Hossein Sarshar

Reputation: 487

dbutil is only supported within databricks. To access the blob storage from non-databricks spark environments like a VM on Azure or HDI-Spark you need to modify the core-site.xml file. Here is a quick guide for a stand-alone spark environment.

Upvotes: 2

Related Questions