How read excel file format in pyspark databricks notebook

How to read the xlsx file format in azure databricks notebook with pyspark programming. we are tried as below code but getting error.

import pandas as pd
spark.createDataFrame(pd.read_excel("/Volumes/test/vls/data/empty data.xlsx"))

is it possible without external library to access the xlsx format?

error : PySparkTypeError: Exception thrown when converting pandas.Series (object) with name

Upvotes: 0

Answers (1)

Reputation: 8150

Below are possible approaches without installing external library

Use pandas and create spark dataframe. Below is the sample code.

import pandas as pd

spark.createDataFrame(pd.read_excel("path_to_excel_file/sample5000.xlsx")).display()

Output:

Unnamed: 0	First Name	Last Name	Gender	Country	Age	Date	Id
1	Dulce	Abril	Female	United States	32	15/10/2017	1562
2	Mara	Hashimoto	Female	Great Britain	25	16/08/2016	1582
3	Philip	Gent	Male	France	36	21/05/2015	2587
4	Kathleen	Hanner	Female	United States	25	15/10/2017	3549

enter image description here

import pyspark

pyspark.pandas.read_excel("file:/<file_path>/sample5000.xlsx").display()

If you don't want to use pandas then only way is to install com.crealytics.spark.excel library in cluster.

Upvotes: 1