mentongwu
mentongwu

Reputation: 473

How to delete the first few rows in dataframe Scala/sSark?

I hava a DataFrame and I want to delete first and the second row. What should I do?

This is my input:

+-----+
|value|
+-----+
|    1|
|    4|
|    3|
|    5|
|    4|
|   18|
-------

This is the excepted result:

+-----+
|value|
+-----+
|    3|
|    5|
|    4|
|   18|
-------

Upvotes: 5

Views: 17744

Answers (2)

Raphael Roth
Raphael Roth

Reputation: 27373

In my opinion it does not make sense to speak about a first or second record if you cannot define an ordering of your dataframe. The ordering of the records as a result of the show statement is "arbitrary" and depends on partitioning of your data.

Suppose you have a column over which you can order your records, you can use Window-functions. Starting with this dataframe:

+----+-----+
|year|value|
+----+-----+
|2007|    1|
|2008|    4|
|2009|    3|
|2010|    5|
|2011|    4|
|2012|   18|
+----+-----+ 

You can do

import org.apache.spark.sql.expressions.Window

df
.withColumn("rn",row_number().over(Window.orderBy($"year")))
.where($"rn">2).drop($"rn")
.show

Upvotes: 5

koiralo
koiralo

Reputation: 23109

The simple and easy way is to assign a id for each row and filter it

val df = Seq(1,2,3,5,4,18).toDF("value")

df.withColumn("id", monotonically_increasing_id()).filter($"id" > 1).drop("id")

Edit: Since the monotonically_increasing_id() does not grantee consecutive You can use zipWithUniqueId as below

val rows = df.rdd.zipWithUniqueId().map {
  case (row, id) => Row.fromSeq(row.toSeq :+ id)
}

val df1 = spark.createDataFrame(rows, StructType(df.schema.fields :+ StructField("id", LongType, false)))

df1.filter($"id" > 1).drop("id")

Output:

+-----+
|value|
+-----+
|    3|
|    5|
|    4|
|   18|
+-----+

This will also help you to drop the nth row in dataframe.

Hope this helps!

Upvotes: 0

Related Questions