How to read csv with second line as header in pyspark dataframe

Question

I am trying to load a csv and make the second line as header. How to achieve this. Please let me know. Thanks.

file_location = "/mnt/test/raw/data.csv"
file_type = "csv"    

infer_schema = "true"
delimiter = ","

data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", "false") \
  .option("sep", delimiter) \
  .load(file_location) \

Girish Iyer · Accepted Answer

First Read the data as rdd and then pass this rdd to df.read.csv()

data=sc.TextFile('/mnt/test/raw/data.csv')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
df = spark.read.csv(data,header=True)

For reference of dataframe functions use the below link, This would serve as bible for all of the dataframe operations you need, for specific version of spark replace "latest" in url to whatever version you want:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

How to read csv with second line as header in pyspark dataframe

Answers (1)

Related Questions