Lilly
Lilly

Reputation: 988

How to read csv with second line as header in pyspark dataframe

I am trying to load a csv and make the second line as header. How to achieve this. Please let me know. Thanks.

file_location = "/mnt/test/raw/data.csv"
file_type = "csv"    

infer_schema = "true"
delimiter = ","

data = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", "false") \
  .option("sep", delimiter) \
  .load(file_location) \

Upvotes: 3

Views: 7033

Answers (1)

Girish Iyer
Girish Iyer

Reputation: 128

First Read the data as rdd and then pass this rdd to df.read.csv()

data=sc.TextFile('/mnt/test/raw/data.csv')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
df = spark.read.csv(data,header=True)

For reference of dataframe functions use the below link, This would serve as bible for all of the dataframe operations you need, for specific version of spark replace "latest" in url to whatever version you want:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Upvotes: 3

Related Questions