Reputation: 493
I'm new to pyspark & working in pyspark 3.0.1 on data bricks. I have two pyspark data frame,df1 & df2, which I need to save in two sheet of an excel file in ADLS gen2. So I wrote the following script
writer = pd.ExcelWriter(path+"multipleSheet.xlsx", engine='xlsxwriter')
df1.toPandas().to_excel(writer, sheet_name='df1')
df2.toPandas().to_excel(writer, sheet_name='df2')
writer.save()
where path is my ADLS path. Unfortunately I'm not finding error while running the code & also not seeing the excel file in my ADLS
Am I doing anything wrong? Kindly suggest
Upvotes: 1
Views: 4152
Reputation: 60
There is a trick to write multiple-sheet excel file to DBFS
import pandas as pd
import shutil
import datetime
import os
def save_report(reports: dict, identificador: str):
date = datetime.datetime.now()
fecha_actual, hora_actual = str(date).split(" ")
path_1 = "dbfs:/path/to/dir"
path_2 = path_1.replace("dbfs:", "/dbfs")
dbutils.fs.mkdirs(path_1)
# Partition
parts = {"part1": {}, "part2": {}}
i = 1
for name in reports.keys():
if i <= 250:
parts["part1"][name] = reports[name]
elif 251 <= i <= 500:
parts["part2"][name] = reports[name]
i += 1
# file path
filename = os.path.join(path_2, f"{identificador}_{fecha_actual}.xlsx")
for pt in parts:
if parts[pt] == {}:
continue
filename_ = filename.replace(".xlsx", f"_{pt}.xlsx")
tmp_path = f'temp_{datetime.datetime.timestamp(datetime.datetime.now())}.xlsx'
writer = pd.ExcelWriter(tmp_path)
for name in parts[pt].keys():
if not partes[pt][name].empty:
parts[pt][name].to_excel(writer, name[0:30], index=False, engine='xlsxwriter', header=False)
writer.save()
shutil.copy(tmp_path, filename_)
cluster_owner_user_id = spark.conf.get('spark.databricks.clusterUsageTags.orgId')
workspace_url = spark.conf.get('spark.databricks.workspaceUrl')
url_ = "https://" + workspace_url + "/files/" + filename_.replace("/dbfs/FileStore/", "") + "?o=" + cluster_owner_user_id
print("Link to download:\n")
print(url_)
return filename_
To use the function:
reports = {"sheet1": df_1, "sheet2": df_2}
save_report(reports, "id_file")
Upvotes: 0
Reputation: 17794
If you want to save multiple sheets in one file you can use:
import pandas as pd
with pd.ExcelWriter('output.xlsx', engine='xlsxwriter') as writer:
df1.toPandas().to_excel(writer, sheet_name='Sheet_name_1')
df2.toPandas().to_excel(writer, sheet_name='Sheet_name_2')
You can also try the default write engine openpyxl
(need to be installed) instead of xlsxwriter
.
Upvotes: 1