Reputation: 171

How to read xlsx or xls files as spark dataframe

Can anyone let me know without converting xlsx or xls files how can we read them as a spark dataframe

I have already tried to read with pandas and then tried to convert to spark dataframe but got the error and the error is

Error:

Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>

Code:

import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

Upvotes: 17

Answers (9)

Raman gupta

Reputation: 1

After installing com.crealytics:spark-excel_2.12:0.13.5

df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "<SHEETNAME>!A1") \
.load("FILEPATH")
display(df)

Upvotes: 0

Joakim Torsvik

Reputation: 85

A simple one-line code to read Excel data to a spark DataFrame is to use the Pandas API on spark to read the data and instantly convert it to a spark DataFrame. That would look like this:

import pyspark.pandas as ps
spark_df = ps.read_excel('<excel file path>', sheet_name='Sheet1', inferSchema='').to_spark()

Upvotes: 0

A5H1Q

Reputation: 624

Steps to read .xls / .xlsx files from Azure Blob storage into a Spark DF

You can read the excel files located in Azure blob storage to a pyspark dataframe with the help of a library called spark-excel. (Also refered as com.crealytics.spark.excel)

Install the library either using the UI or Databricks CLI. (Cluster settings page > Libraries > Install new option. Make sure to chose maven)
Once the library is installed. You need proper credentials to access Azure blob storage. You can provide the access key in Cluster settings page > Advanced option > Spark configs

Example:

spark.hadoop.fs.azure.account.key.<storage-account>.blob.core.windows.net <access key>

Note: If you're the cluster owner you can provide it as a secret instead of giving access key as plain text as mentioned in the docs

Restart the cluster. you can use below code to read those excel files located in blob storage

filePath = "wasbs://<container-name>@<storage-account>.blob.core.windows.net/MyFile1.xls"

DF = spark.read.format("excel").option("header", "true").option("inferSchema", "true").load(filePath)

display(DF)

PS: The spark.read.format("excel") is the V2 approach. while spark.read.format("com.crealytics.spark.excel") is the V1, you can read more here

Upvotes: 2

KaranSingh

Reputation: 630

Below configuration and code works for me to read excel file into pyspark dataframe. Pre-requisites before executing python code.

Install Maven library on your databricks cluster.

Maven library name & version: com.crealytics:spark-excel_2.12:0.13.5

Databricks Runtime: 9.0 (includes Apache Spark 3.1.2, Scala 2.12)

Execute below code in your python notebook to load excel file into pyspark dataframe:

  sheetAddress = "'<enter sheetname>'!A1"
  filePath = "<enter excel file full path>"
  df = spark.read.format("com.crealytics.spark.excel") \
                                .option("header", "true") \
                                .option("dataAddress", sheetAddress) \
                                .option("treatEmptyValuesAsNulls", "false") \
                                .option("inferSchema", "true") \
                                .load(filePath)

Upvotes: 0

matkurek

Reputation: 781

You can read excel file through spark's read function. That requires a spark plugin, to install it on databricks go to:

clusters > your cluster > libraries > install new > select Maven and in 'Coordinates' paste com.crealytics:spark-excel_2.12:0.13.5

After that, this is how you can read the file:

df = spark.read.format("com.crealytics.spark.excel") \
    .option("useHeader", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

Upvotes: 1

Alex

Reputation: 39

Just open file xlsx or xlms,open file in pandas,after that in spark

import pandas as pd

df = pd.read_excel('file.xlsx', engine='openpyxl')

df = spark_session.createDataFrame(df.astype(str))

Upvotes: 0

Andrea Baldino

Reputation: 461

I try to give a general updated version at April 2021 based on the answers of @matkurek and @Peter Pan.

SPARK

You should install on your databricks cluster the following 2 libraries:

Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

Then, you will be able to read your excel as follows:

sparkDF = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
    .load(filePath)

PANDAS

You should install on your databricks cluster the following 2 libraries:

Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: openpyxl

Then, you will be able to read your excel as follows:

import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet')

Note that you will have two different objects, in the first scenario a Spark Dataframe, in the second a Pandas Dataframe.

Upvotes: 28

Jorge Abreu

Reputation: 81

As mentioned by @matkurek you can read it from excel directly. Indeed, this should be a better practice than involving pandas since then the benefit of Spark would not exist anymore.

You can run the same code sample as defined qbove, but just adding the class needed to the configuration of your SparkSession.

spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()

Then, you can read your excel file.

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))

Upvotes: 8

Peter Pan

Reputation: 24148

There is no data of your excel shown in your post, but I had reproduced the same issue as yours.

Here is the data of my sample excel test.xlsx, as below.

You can see there are different data types in my column B: a double value 2.2 and a string value C.

So if I run the code below,

import pandas

df = pandas.read_excel('test.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)

it will return a same error as yours.

TypeError: field B: Can not merge type <class 'pyspark.sql.types.DoubleType'> and class 'pyspark.sql.types.StringType'>

If we tried to inspect the dtypes of df columns via df.dtypes, we will see.

The dtype of Column B is object, the spark.createDateFrame function can not inference the real data type for column B from the real data. So to fix it, the solution is to pass a schema to help data type inference for column B, as the code below.

from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([StructField("A", DoubleType(), True), StructField("B", StringType(), True)])
sdf = spark.createDataFrame(df, schema=schema)

To force make column B as StringType to solve the data type conflict.

Upvotes: 7

How to read xlsx or xls files as spark dataframe

Answers (9)

Steps to read .xls / .xlsx files from Azure Blob storage into a Spark DF

Related Questions