Anonymous
Anonymous

Reputation: 319

How to read excel file (.xlsx) using Pyspark and store it in dataframe?

I have data in excel file (.xlsx). How to read this excel data and store it in the data frame in spark?

Upvotes: 2

Views: 4807

Answers (2)

Mayanglambam Yaiphaba
Mayanglambam Yaiphaba

Reputation: 61

On your databricks cluster, install following 2 libraries:

Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5

Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

Then, you will be able to read your excel as follows:

sparkDF = spark.read.format("com.crealytics.spark.excel")
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

Upvotes: 2

You could use Pandas API which is now part of PySpark.

Here is the documentation: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html

Upvotes: 1

Related Questions