Debtanu Gupta
Debtanu Gupta

Reputation: 25

How to write a binary file directly from Databricks (PySpark) to Azure DataLake?

I'm trying to writing some binary data into a file directly to ADLS from Databricks. Basically, I'm fetching the content of a docx file from Salesforce and want it to store the content of it into ADLS. I'm using PySpark.

Here is my first try:

file_path = "adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx"
data = request.content # fetched binary data 

with open(file_path, "wb") as file:
    file.write(data)

And the error I get is:

with open(file_path, "wb") as file:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory:
'adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx'

Now, the second try:

file_path = "adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx"
data = request.content

dbutils.fs.put(file_path, data, True)

Again, an error:

dbutils.fs.put(file_path, data, True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: b'PK\x03\x04\x14\x00\x06\x00\x08\x00...

But, when I'm trying to write some normal unicode string using dbutils.fs.put(), it's working fine.

dbutils.fs.put(file_path, "abcd", True)

# adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx
# Wrote 4 bytes.

I've also used base64, but not getting the desired result.

dbutils.fs.put(file_path, base64.b64encode(data).decode('utf-8'), True)

It's saving the file, but the file is becoming unreadable.

Can anyone please help me to complete my task??

Upvotes: 1

Views: 3111

Answers (2)

Alex Ott
Alex Ott

Reputation: 87259

Just use dbutils.fs commands (doc) to copy the file written to the local disk. With something like this:

file_path = "adl://<something>.azuredatalakestore.net/<...folders...>/Report.docx"
data = request.content # fetched binary data 
tmp_path = "/tmp/tmp.file"

with open(tmp_path, "wb") as file:
    file.write(data)

dbutils.fs.cp(f"file:{tmp_path}", file_path)

The last command (or use dbutils.fs.mv) will use the configured credentials to access ADLS.

Upvotes: 0

you need to create an Azure DataLake Storage Gen2 account and a container. Note down the Account name, Container name, and Account key enter image description here enter image description here

Mount the ADLS to Databricks using the mounting script:

dbutils.fs.mount(
    source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/",
    mount_point = "/mnt/io89765",
    extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<storage-account-Access key>"})

enter image description here

Install the azure-storage-file-datalake package in Databricks cluster. You can run the following command:

   %pip install azure-storage-file-datalake

enter image description here

Use the below command to list your mounted file.

dbutils.fs.ls("/mnt/io243")

enter image description here

Read the file path and read in binary format

docx_file_path = "/dbfs/mnt/io243/docx.docx"

with open(docx_file_path, "rb") as f:
  binary_data = f.read()

Writing the file to Dataframe:

    from pyspark.sql.types import StructType, StructField, BinaryType


schema = StructType([StructField("data", BinaryType())])
df = spark.createDataFrame([(binary_data,)], schema=schema)

display the binary format:

df.display()

enter image description here

Upvotes: 0

Related Questions