Maximum message size on Azure Databricks

Question

I'm using databricks with python on Azure to process my data.the result of this process will be saved as csv file on azure blob storage.

but here's the problem. when the result file is more than 750 Mb an error occurred.

after some research on google, I knew that I have to increase my Scala.rc.message.maxSize, and I did that. the problem is the maximum size that I can set is only 2Gb, and as I using databricks to analyze big data, I do expecting a file much more than 2 Gb.

the question is:

is 2 Gb is really the maximum message size which are supported on Azure Databricks? I tried to search and go through the official document from Microsoft but cannot find any information regarding this.
is there any way for me to increase the value? or even set it to scalable depends on my data.

here is my python code for these process.

#mount azure storage to my databricks
dbutils.fs.mount(
  source = "wasbs://mystoragecontainer.blob.core.windows.net",
  mount_point = "/mnt/test3",
  extra_configs = {"fs.azure.account.key.mystoragecontainer.blob.core.windows.net":dbutils.secrets.get(scope = "myapps", key = "myappskey")})


#define saving process in a function
def save_data(df, savefile):
  df.coalesce(1).write.mode("overwrite").options(header="true").format("com.databricks.spark.csv").save(savefile)
  res = savefile.split('/')
  ls_target = savefile.rstrip(res[-1])
  dbutils.fs.ls(savefile+"/")
  fileList = dbutils.fs.ls(savefile+"/")
  target_name = ""
  for item in fileList:
    if item.name.endswith("csv"):
      filename= item.path
      target_parts = filename.split('/')
      target_name = filename.replace('/'+target_parts[-2]+'/', '/')
      print(target_name)
      dbutils.fs.mv(filename, ls_target)
    else:
      filename= item.path
      dbutils.fs.rm(filename, True)
  dbutils.fs.rm(savefile, True)
  dbutils.fs.mv(target_name, savefile)

# call my save function
save_data(df,"dbfs:/mnt/test3/myfolderpath/japanese2.csv")

any information would be appreciated.

bests,

Axel R. · Accepted Answer

If I understand correctly, you want to merge the distributed csv generated by :

df.coalesce(1).write.mode("overwrite").options(header="true").format("com.databricks.spark.csv").save(savefile)

I would suggest you try to convert it into a pandas dataframe and write into a single csv like below :

# call my save function
df.toPandas().to_csv("/dbfs/mnt/test3/myfolderpath/japanese2.csv")

This should write a single csv containing all the data in your dataframe. Be careful to use /dbfs/ when using Pandas as it uses the file API instead of the DBFS API.

Also, this is pySpark, not really scala.

Maximum message size on Azure Databricks

Answers (1)

Related Questions