Lambo
Lambo

Reputation: 1174

Load Large Excel Files in Databricks using PySpark from an ADLS mount

We are trying to load a large'ish excel file from a mounted Azure Data Lake location using pyspark on Databricks.

We have used pyspark.pandas to load and we have used spark-excel to load, not with a lot of success

PySpark.Pandas

import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")

We are experiencing some conversion error as below

ArrowTypeError: Expected bytes, got a 'int' object

spark-excel

df=spark.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .option("inferSchema","false") \
        .load('dbfs:/mnt/aadata/ds/data/test.xlsx')

We are able to load a smaller file, but a larger file gives the following error

org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.

Is there any other way to load excel files in databricks with pyspark?

Upvotes: 0

Views: 1425

Answers (1)

ASH
ASH

Reputation: 20342

In your Excel file, there's probably some kind of weird format, or some kind of special character, that is preventing it from working. Save the Excel file as a CSV file, and re-try. You should easily be able to load a CSV file, because it has no weird things of any kind, whereas Excel has all kinds of weird things embedded in it.

Upvotes: 1

Related Questions