mytabi
mytabi

Reputation: 779

databricks: writing spark dataframe directly to excel

Are there any method to write spark dataframe directly to xls/xlsx format ????

Most of the example in the web showing there is example for panda dataframes.

but I would like to use spark dataframe for working with my data. Any idea ?

Upvotes: 9

Views: 49131

Answers (4)

Sander Vanden Hautte
Sander Vanden Hautte

Reputation: 2543

It is possible to generate an Excel file directly from pySpark, without converting to Pandas first:

df_spark.write.format("com.crealytics.spark.excel")\
  .option("header", "true")\
  .mode("overwrite")\
  .save(path)

In order to be able to run the above code, you need to install the com.crealytics:spark-excel_2.12:0.13.5 (or a more recent version of course) library though, for example in Azure Databricks by specifying it as a new Maven library in the libraries list of your cluster (one of the buttons on the left sidebar of the Databricks UI).

For more info see https://github.com/crealytics/spark-excel.

Upvotes: 13

Jha Ayush
Jha Ayush

Reputation: 87

You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:

import xlsxwriter import pandas as pd1 

workbook = xlsxwriter.Workbook('data_checks_output.xlsx') 

worksheet = workbook.add_worksheet('top_rows') 

Create a Pandas Excel writer using XlsxWriter as the engine.

writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter') 

output = dataset.limit(10) 
output = output.toPandas() 
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)

writer.save()

Below code does the work of moving files.

%sh
sudo mv data_checks_output.xlsx /dbfs/mnt/fpmount/

Comment if anyone has new update or better way to do it.

Upvotes: 1

Papa_Helix
Papa_Helix

Reputation: 716

I'm assuming that because you have the "databricks" tag you are wanting to create an .xlsx file within databricks file store and that you are running code within databricks notebooks. I'm also going to assume that your notebooks are running python.

There is no direct way to save an excel document from a spark dataframe. You can, however, convert a spark dataframe to a pandas dataframe then export from there. We'll need to start by installing the xlsxwriter package. You can do this for your notebook environment using a databricks utilites command:

dbutils.library.installPyPI('xlsxwriter')
dbutils.library.restartPython()

I was having a few permission issues saving an excel file directly to dbfs. A quick workaround was to save to the cluster's default directory then sudo move the file into dbfs. Here's some example code:

# Creating dummy spark dataframe
spark_df = spark.sql('SELECT * FROM default.test_delta LIMIT 100')

# Converting spark dataframe to pandas dataframe
pandas_df = spark_df.toPandas()

# Exporting pandas dataframe to xlsx file
pandas_df.to_excel('excel_test.xlsx', engine='xlsxwriter')

Then in a new command, specifying the command to run in shell with %sh:

%sh
sudo mv excel_test.xlsx /dbfs/mnt/data/

Upvotes: 9

ASH
ASH

Reputation: 20302

I believe you can do it like this.

sourcePropertySet.write
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .save("D:\\resultset.csv")

I'm not sure you can write directly to Excel, but Excel can definitely consume a CSV. This is almost certainly the easiest way of doing this kind of thing and the cleanest as well. In Excel you have all kinds of formatting, which can throw errors when used in some systems (think of merged cells).

Upvotes: 0

Related Questions