user175025
user175025

Reputation: 434

How to read excel (.xlsx) file into a pyspark dataframe

I have an excel file (.xlsx) file in the datalake. I need to read that file into a pyspark dataframe. I do no want to use pandas library.

I have installed the crealytics library in my databricks cluster and tried with below code:

dbutils.fs.cp('/path/to/excel/file','/FileStore/tables/',True)

path='/dbfs/FileStore/tables//myfile1.xlsx'

excel_df=spark.read.format("com.crealytics.spark.excel").option("header","true").option("inferSchema","true").load("/FileStore/tables/myfile1.xlsx")

Im getting the below error:

java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B

Am I missing anything here or any other approach can be tried other than Pandas. Also I need to read multiple sheets in the excel file. Please suggest.

Upvotes: 1

Views: 6732

Answers (1)

Mohammed Ehtesham
Mohammed Ehtesham

Reputation: 21

I was getting the same error. Found out the problem was with the package version. I installed the new version 0.13.8 with Scala 2.12 and it's working.

path="/mnt/replacemountpointname/path/filename.xlsx"
df = spark.read.format("com.crealytics.spark.excel").options(header='True', inferSchema='True').load(path)

Link for ref: https://www.youtube.com/watch?v=ib8Zch_4744

Upvotes: 2

Related Questions