Shobhana Mani
Shobhana Mani

Reputation: 43

Skip first line of a csv file in scala

I want to read and write a csv file ignoring the first line as the header starts from second line.

val df =df1.withColumn("index", monotonicallyIncreasingId()).filter(col("index") > 1).drop("index")

This does not resolves my issue.

Upvotes: 1

Views: 1459

Answers (2)

viapak
viapak

Reputation: 91

Let me show you my try. If header is present in second line of csv file and we need to ignore first line.

val df=spark.read.text("/yourFilePath/my.csv").withColumn("row_id",monotonically_increasing_id)
val cols=df.select("value").filter('row_id===1).first.mkString.split(",")  
val df2 = df.filter('row_id>1).
             withColumn("temp", split(col("value"), ",")).
             select((0 until cols.length).map(i => col("temp").getItem(i).as(cols.apply(i))): _*)

Before

+------------+------+
|       value|row_id|
+------------+------+
|    aasadasd|     0|
|name,age,des|     1|
|      a,2,dd|     2|
|      b,5,ff|     3|
+------------+------+

After

+----+---+---+
|name|age|des|
+----+---+---+
|a   |2  |dd |
|b   |5  |ff |
+----+---+---+

Upvotes: 1

Lamanus
Lamanus

Reputation: 13551

Left anti join maybe?

header = df.limit(1)
df.join(header, df.columns, 'left_anti').show()

or use rdd filter.

header = df.first()
df.rdd.filter(lambda x: x != header).toDF(df.columns).show()

Upvotes: 2

Related Questions