Reputation: 523
I have a dataframe that has values such as
-------------
| col1 | col2 |
-------------
| a | 2 |
| b | 3 |
| c | 4 |
| d | 5 |
-------------
I want to create a third column that has an auto incrementing value starting from a specified number.
Suppose I want the number to start from 5, then the expected output should be
--------------------
| col1 | col2 | col3 |
|--------------------|
| a | 2 | 5 |
| b | 3 | 6 |
| c | 4 | 7 |
| d | 5 | 8 |
--------------------
I have tried monotonically_increasing_id()
but that does not give sequential numbers due to partitioning and it also does not have the feature to start at a specified number.
I am using Spark 2.4.5 and Pyspark (python).
Upvotes: 1
Views: 2976
Reputation: 31490
We can use window function row_number
with orderby()
(for sorting globally) clause and add 4
to start the sequence from 5
In Pyspark:
Using row_number():
from pyspark.sql.window import Window
from pyspark.sql.functions import *
df=spark.createDataFrame([('a',2),('b',3),('c',4),('d',5)],['col1','col2']).repartition(4)
w=Window.orderBy('col2')
df.withColumn("col3",row_number().over(w) + 4).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| a| 2| 5|
#| b| 3| 6|
#| c| 4| 7|
#| d| 5| 8|
#+----+----+----+
In Scala:
val w = Window.orderBy("col2")
df.withColumn("col3", row_number().over(w) + 4).show()
//+----+----+----+
//|col1|col2|col3|
//+----+----+----+
//| a| 2| 5|
//| b| 3| 6|
//| c| 4| 7|
//| d| 5| 8|
//+----+----+----+
Upvotes: 3