JDoe
JDoe

Reputation: 493

How do I save excel file with multiple sheets from pyspark data frame

I'm new to pyspark & working in pyspark 3.0.1 on data bricks. I have two pyspark data frame,df1 & df2, which I need to save in two sheet of an excel file in ADLS gen2. So I wrote the following script

writer = pd.ExcelWriter(path+"multipleSheet.xlsx", engine='xlsxwriter')
df1.toPandas().to_excel(writer, sheet_name='df1')
df2.toPandas().to_excel(writer, sheet_name='df2')
writer.save()

where path is my ADLS path. Unfortunately I'm not finding error while running the code & also not seeing the excel file in my ADLS

Am I doing anything wrong? Kindly suggest

Upvotes: 1

Views: 4152

Answers (2)

Ivannpy
Ivannpy

Reputation: 60

There is a trick to write multiple-sheet excel file to DBFS

import pandas as pd
import shutil
import datetime
import os


def save_report(reports: dict, identificador: str):
  date = datetime.datetime.now()
  fecha_actual, hora_actual = str(date).split(" ")

  path_1 = "dbfs:/path/to/dir"
  path_2 = path_1.replace("dbfs:", "/dbfs")
  dbutils.fs.mkdirs(path_1)

  # Partition
  parts = {"part1": {}, "part2": {}}
  i = 1
  for name in reports.keys():
    if i <= 250:
      parts["part1"][name] = reports[name]
    elif 251 <= i <= 500:
      parts["part2"][name] = reports[name]
    i += 1

  # file path
  filename = os.path.join(path_2, f"{identificador}_{fecha_actual}.xlsx")


  for pt in parts:
    if parts[pt] == {}:
      continue
    filename_ = filename.replace(".xlsx", f"_{pt}.xlsx")
    tmp_path = f'temp_{datetime.datetime.timestamp(datetime.datetime.now())}.xlsx'
    writer = pd.ExcelWriter(tmp_path)

    for name in parts[pt].keys():
      if not partes[pt][name].empty:
        parts[pt][name].to_excel(writer, name[0:30], index=False, engine='xlsxwriter', header=False)

    writer.save()
    shutil.copy(tmp_path, filename_)

    cluster_owner_user_id = spark.conf.get('spark.databricks.clusterUsageTags.orgId')
    workspace_url = spark.conf.get('spark.databricks.workspaceUrl')
    url_ = "https://" + workspace_url + "/files/" + filename_.replace("/dbfs/FileStore/", "") + "?o=" + cluster_owner_user_id

    print("Link to download:\n")
    print(url_)

    return filename_

To use the function:

reports = {"sheet1": df_1, "sheet2": df_2}
save_report(reports, "id_file")

Upvotes: 0

Mykola Zotko
Mykola Zotko

Reputation: 17794

If you want to save multiple sheets in one file you can use:

import pandas as pd

with pd.ExcelWriter('output.xlsx', engine='xlsxwriter') as writer:
    df1.toPandas().to_excel(writer, sheet_name='Sheet_name_1')
    df2.toPandas().to_excel(writer, sheet_name='Sheet_name_2')

You can also try the default write engine openpyxl (need to be installed) instead of xlsxwriter.

Upvotes: 1

Related Questions