Pyspark: auto-increment starting from specific value

Question

I need to add a column to my dataframe that would increment by 1 but starting from 500. So the first row would be 500, the second one 501 etc. It doesn't make sense to use UDF, since it can be executed on a different workers and I don't know any function that would take starting value as a parameter. I don't have anything that I could sort my dataframe on either. Both row number and auto increment would start on 1 by default. I believe I can do it in but transforming my df to rdd and back to df seems to be quite ugly solution. Do you know of any existing function that would help me to solve in on a dataframe level?

Thank you!

Cena · Accepted Answer

Since monotonically_increasing_id() isn't consecutive, you can use row_num() over monotonically_increasing_id() and add 499.

from pyspark.sql.window import Window

df = df.withColumn("idx", monotonically_increasing_id())
w = Window().orderBy("idx")
df.withColumn("row_num", (499 + row_number().over(w))).show()

Pyspark: auto-increment starting from specific value

Answers (2)

Related Questions