Grevioos
Grevioos

Reputation: 405

Pyspark: auto-increment starting from specific value

I need to add a column to my dataframe that would increment by 1 but starting from 500. So the first row would be 500, the second one 501 etc. It doesn't make sense to use UDF, since it can be executed on a different workers and I don't know any function that would take starting value as a parameter. I don't have anything that I could sort my dataframe on either. Both row number and auto increment would start on 1 by default. I believe I can do it in but transforming my df to rdd and back to df seems to be quite ugly solution. Do you know of any existing function that would help me to solve in on a dataframe level?

Thank you!

Upvotes: 3

Views: 5421

Answers (2)

Cena
Cena

Reputation: 3419

Since monotonically_increasing_id() isn't consecutive, you can use row_num() over monotonically_increasing_id() and add 499.

from pyspark.sql.window import Window

df = df.withColumn("idx", monotonically_increasing_id())
w = Window().orderBy("idx")
df.withColumn("row_num", (499 + row_number().over(w))).show()

Upvotes: 2

Hussein Awala
Hussein Awala

Reputation: 5096

I think you can use monotonically_increasing_id function which starts from 0, but you can start from a custom offset by adding a constant value to each offset:

offset = start_offset + monotonically_increasing_id()

Upvotes: 0

Related Questions