Reading excel files in pyspark with 3rd row as header

Question

I want to read read excel files as spark dataframe with 3rd row as a header.The synatax to read excel files as spark dataframe with 1st row as header is :

s_df  = spark.read.format("com.crealytics.spark.excel") \
                           .option("header", "true") \
                           .option("inferSchema", "true") \
                           .load(path + 'Sales.xlsx')

and the equivalent syntax to read as pandas dataframe with 3rd row as header is :

p_df = pd.read_excel(path + 'Sales.xlsx',header=3)

I want to do the same thing in pyspark that is to read excel files as spark dataframe with 3rd row as header.

SA2010 · Accepted Answer

Use the dataAddress option to specify the cell/row where the data is located . As you need to skip two rows, your data (including header) starts from row A3.

s_df = spark.read.format("com.crealytics.spark.excel") \
           .option("header", "true") \
           .option("inferSchema","true") \
           .option("dataAddress", "'Sheet1'!A3") \
           .load("yourfilepath")

Also, note that if your first two rows are empty, then dataAddress does not have to specified. The leading null rows will be skipped by default.

Check the documentation here

Reading excel files in pyspark with 3rd row as header

Answers (1)

Related Questions