Reputation: 187
I want to read read excel files as spark dataframe with 3rd row as a header.The synatax to read excel files as spark dataframe with 1st row as header is :
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(path + 'Sales.xlsx')
and the equivalent syntax to read as pandas dataframe with 3rd row as header is :
p_df = pd.read_excel(path + 'Sales.xlsx',header=3)
I want to do the same thing in pyspark that is to read excel files as spark dataframe with 3rd row as header.
Upvotes: 4
Views: 4447
Reputation: 323
Use the dataAddress option to specify the cell/row where the data is located . As you need to skip two rows, your data (including header) starts from row A3.
s_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema","true") \
.option("dataAddress", "'Sheet1'!A3") \
.load("yourfilepath")
Also, note that if your first two rows are empty, then dataAddress does not have to specified. The leading null rows will be skipped by default.
Check the documentation here
Upvotes: 3