Can parquet, avro and other hadoop file formats have different layout for first line?

Question

Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types? I know writing RDD as these formats is not supported. I was actually trying to write a parquet file with first line containing only the header date and other lines containing the detail records. A sample file layout

2019-04-06
101,peter,20000
102,robin,25000

I want to create a parquet with the above contents. I already have a csv file sample.csv with above contents. The csv file when read as dataframe contains only the first field as the first row has only one column.

rdd = sc.textFile('hdfs://somepath/sample.csv')
df = rdd.toDF()
df.show()

o/p:

2019-04-06
101
102

Could someone please help me with converting the entire contents of rdd into dataframe. Even when i try reading the file directly as a df instead of converting from rdd same thing happens.

Can parquet, avro and other hadoop file formats have different layout for first line?

Answers (1)

Related Questions