Load Large Excel Files in Databricks using PySpark from an ADLS mount

Question

We are trying to load a large'ish excel file from a mounted Azure Data Lake location using pyspark on Databricks.

We have used pyspark.pandas to load and we have used spark-excel to load, not with a lot of success

PySpark.Pandas

import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")

We are experiencing some conversion error as below

ArrowTypeError: Expected bytes, got a 'int' object

spark-excel

df=spark.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .option("inferSchema","false") \
        .load('dbfs:/mnt/aadata/ds/data/test.xlsx')

We are able to load a smaller file, but a larger file gives the following error

org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.

Is there any other way to load excel files in databricks with pyspark?

Load Large Excel Files in Databricks using PySpark from an ADLS mount

PySpark.Pandas

spark-excel

Answers (1)

Related Questions