Reputation: 1352
I am working in DataBricks, where I have a DataFrame.
type(df)
Out: pyspark.sql.dataframe.DataFrame
The only thing that I want, is to write this complete spark dataframe into an Azure Blob Storage.
I found this post. So I tried that code:
# Configure blob storage account access key globally
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
sas_key)
output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(datafiles
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
Running that code is leading to the error below. Changing the "csv" part for parquet and other formats is also failing.
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<AccessoryMaterials:string,CommercialOptions:string,DocumentsUsed:array<string>,Enumerations:array<string>,EnvironmentMeasurements:string,Files:array<struct<Value:string,checksum:string,checksumType:string,name:string,size:string>>,GlobalProcesses:string,Printouts:array<string>,Repairs:string,SoftwareCapabilities:string,TestReports:string,endTimestamp:string,name:string,signature:string,signatureMeaning:bigint,startTimestamp:string,status:bigint,workplace:string> data type.;
Therefore my question (and this should be easy is my assumption): How can I write my spark dataframe from DataBricks to an Azure Blob Storage?
My Azure folder structure is like this:
Account = MainStorage
Container 1 is called "Data" # containing all the data, irrelevant because i already read this in.
Container 2 is called "Output" # here I want to store my Spark Dataframe.
Many thanks in advance!
EDIT I am using Python. However, I don't mind if the solution is in other languages (as long as DataBricks support them, like R/Scala etc.). If it works, it is perfect :-)
Upvotes: 1
Views: 7012
Reputation: 1258
Assuming you have already mounted the blob storage, Use the below approach to write your data frame as a csv format.
Please note newly created file would have the some default file name with csv
extension hence you might need to rename it with a consistent name.
// output_container_path= wasbs://ContainerName@StorageAccountName.blob.core.windows.net/DirectoryName
val mount_root = "/mnt/ContainerName/DirectoryName"
df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")
Upvotes: 1