Reputation: 354
How to read the xlsx file format in azure databricks notebook with pyspark programming. we are tried as below code but getting error.
import pandas as pd
spark.createDataFrame(pd.read_excel("/Volumes/test/vls/data/empty data.xlsx"))
is it possible without external library to access the xlsx format?
error : PySparkTypeError: Exception thrown when converting pandas.Series (object) with name
Upvotes: -1
Views: 95
Reputation: 7985
Below are possible approaches without installing external library
Use pandas
and create spark dataframe.
Below is the sample code.
import pandas as pd
spark.createDataFrame(pd.read_excel("path_to_excel_file/sample5000.xlsx")).display()
Output:
Unnamed: 0 | First Name | Last Name | Gender | Country | Age | Date | Id |
---|---|---|---|---|---|---|---|
1 | Dulce | Abril | Female | United States | 32 | 15/10/2017 | 1562 |
2 | Mara | Hashimoto | Female | Great Britain | 25 | 16/08/2016 | 1582 |
3 | Philip | Gent | Male | France | 36 | 21/05/2015 | 2587 |
4 | Kathleen | Hanner | Female | United States | 25 | 15/10/2017 | 3549 |
or
import pyspark
pyspark.pandas.read_excel("file:/<file_path>/sample5000.xlsx").display()
If you don't want to use pandas then only way is to install
com.crealytics.spark.excel
library in cluster.
Upvotes: 1