Reputation: 1174
We are trying to load a large'ish excel file from a mounted Azure Data Lake location using pyspark on Databricks.
We have used pyspark.pandas to load and we have used spark-excel to load, not with a lot of success
import pyspark.pandas as ps
df = ps.read_excel("dbfs:/mnt/aadata/ds/data/test.xlsx",engine="openpyxl")
We are experiencing some conversion error as below
ArrowTypeError: Expected bytes, got a 'int' object
df=spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","false") \
.load('dbfs:/mnt/aadata/ds/data/test.xlsx')
We are able to load a smaller file, but a larger file gives the following error
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 185,568,653, but the maximum length for this record type is 100,000,000.
Is there any other way to load excel files in databricks with pyspark?
Upvotes: 0
Views: 1425
Reputation: 20342
In your Excel file, there's probably some kind of weird format, or some kind of special character, that is preventing it from working. Save the Excel file as a CSV file, and re-try. You should easily be able to load a CSV file, because it has no weird things of any kind, whereas Excel has all kinds of weird things embedded in it.
Upvotes: 1