Reputation: 473
I hava a DataFrame and I want to delete first and the second row. What should I do?
This is my input:
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
-------
This is the excepted result:
+-----+
|value|
+-----+
| 3|
| 5|
| 4|
| 18|
-------
Upvotes: 5
Views: 17744
Reputation: 27373
In my opinion it does not make sense to speak about a first or second record if you cannot define an ordering of your dataframe. The ordering of the records as a result of the show
statement is "arbitrary" and depends on partitioning of your data.
Suppose you have a column over which you can order your records, you can use Window-functions. Starting with this dataframe:
+----+-----+
|year|value|
+----+-----+
|2007| 1|
|2008| 4|
|2009| 3|
|2010| 5|
|2011| 4|
|2012| 18|
+----+-----+
You can do
import org.apache.spark.sql.expressions.Window
df
.withColumn("rn",row_number().over(Window.orderBy($"year")))
.where($"rn">2).drop($"rn")
.show
Upvotes: 5
Reputation: 23109
The simple and easy way is to assign a id for each row and filter it
val df = Seq(1,2,3,5,4,18).toDF("value")
df.withColumn("id", monotonically_increasing_id()).filter($"id" > 1).drop("id")
Edit: Since the monotonically_increasing_id()
does not grantee consecutive You can use zipWithUniqueId
as below
val rows = df.rdd.zipWithUniqueId().map {
case (row, id) => Row.fromSeq(row.toSeq :+ id)
}
val df1 = spark.createDataFrame(rows, StructType(df.schema.fields :+ StructField("id", LongType, false)))
df1.filter($"id" > 1).drop("id")
Output:
+-----+
|value|
+-----+
| 3|
| 5|
| 4|
| 18|
+-----+
This will also help you to drop the nth row in dataframe.
Hope this helps!
Upvotes: 0