Reputation: 43
I want to read and write a csv file ignoring the first line as the header starts from second line.
val df =df1.withColumn("index", monotonicallyIncreasingId()).filter(col("index") > 1).drop("index")
This does not resolves my issue.
Upvotes: 1
Views: 1459
Reputation: 91
Let me show you my try. If header is present in second line of csv file and we need to ignore first line.
val df=spark.read.text("/yourFilePath/my.csv").withColumn("row_id",monotonically_increasing_id)
val cols=df.select("value").filter('row_id===1).first.mkString.split(",")
val df2 = df.filter('row_id>1).
withColumn("temp", split(col("value"), ",")).
select((0 until cols.length).map(i => col("temp").getItem(i).as(cols.apply(i))): _*)
Before
+------------+------+
| value|row_id|
+------------+------+
| aasadasd| 0|
|name,age,des| 1|
| a,2,dd| 2|
| b,5,ff| 3|
+------------+------+
After
+----+---+---+
|name|age|des|
+----+---+---+
|a |2 |dd |
|b |5 |ff |
+----+---+---+
Upvotes: 1
Reputation: 13551
Left anti join maybe?
header = df.limit(1)
df.join(header, df.columns, 'left_anti').show()
or use rdd filter.
header = df.first()
df.rdd.filter(lambda x: x != header).toDF(df.columns).show()
Upvotes: 2