Reputation: 25
I'm trying to writing some binary data into a file directly to ADLS from Databricks. Basically, I'm fetching the content of a docx file from Salesforce and want it to store the content of it into ADLS. I'm using PySpark.
Here is my first try:
file_path = "adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx"
data = request.content # fetched binary data
with open(file_path, "wb") as file:
file.write(data)
And the error I get is:
with open(file_path, "wb") as file:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory:
'adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx'
Now, the second try:
file_path = "adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx"
data = request.content
dbutils.fs.put(file_path, data, True)
Again, an error:
dbutils.fs.put(file_path, data, True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: b'PK\x03\x04\x14\x00\x06\x00\x08\x00...
But, when I'm trying to write some normal unicode string using dbutils.fs.put(), it's working fine.
dbutils.fs.put(file_path, "abcd", True)
# adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx
# Wrote 4 bytes.
I've also used base64, but not getting the desired result.
dbutils.fs.put(file_path, base64.b64encode(data).decode('utf-8'), True)
It's saving the file, but the file is becoming unreadable.
Can anyone please help me to complete my task??
Upvotes: 1
Views: 3111
Reputation: 87259
Just use dbutils.fs
commands (doc) to copy the file written to the local disk. With something like this:
file_path = "adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx"
data = request.content # fetched binary data
tmp_path = "/tmp/tmp.file"
with open(tmp_path, "wb") as file:
file.write(data)
dbutils.fs.cp(f"file:{tmp_path}", file_path)
The last command (or use dbutils.fs.mv
) will use the configured credentials to access ADLS.
Upvotes: 0
Reputation: 3250
you need to create an Azure DataLake Storage Gen2 account and a container.
Note down the Account name, Container name, and Account key
Mount the ADLS to Databricks using the mounting script:
dbutils.fs.mount(
source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/",
mount_point = "/mnt/io89765",
extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<storage-account-Access key>"})
Install the azure-storage-file-datalake package in Databricks cluster. You can run the following command:
%pip install azure-storage-file-datalake
Use the below command to list your mounted file.
dbutils.fs.ls("/mnt/io243")
Read the file path and read in binary format
docx_file_path = "/dbfs/mnt/io243/docx.docx"
with open(docx_file_path, "rb") as f:
binary_data = f.read()
Writing the file to Dataframe:
from pyspark.sql.types import StructType, StructField, BinaryType
schema = StructType([StructField("data", BinaryType())])
df = spark.createDataFrame([(binary_data,)], schema=schema)
display the binary format:
df.display()
Upvotes: 0