Reputation: 405
I need to add a column to my dataframe that would increment by 1 but starting from 500. So the first row would be 500, the second one 501 etc. It doesn't make sense to use UDF, since it can be executed on a different workers and I don't know any function that would take starting value as a parameter. I don't have anything that I could sort my dataframe on either. Both row number and auto increment would start on 1 by default. I believe I can do it in but transforming my df to rdd and back to df seems to be quite ugly solution. Do you know of any existing function that would help me to solve in on a dataframe level?
Thank you!
Upvotes: 3
Views: 5421
Reputation: 3419
Since monotonically_increasing_id()
isn't consecutive, you can use row_num()
over monotonically_increasing_id()
and add 499.
from pyspark.sql.window import Window
df = df.withColumn("idx", monotonically_increasing_id())
w = Window().orderBy("idx")
df.withColumn("row_num", (499 + row_number().over(w))).show()
Upvotes: 2
Reputation: 5096
I think you can use monotonically_increasing_id function which starts from 0, but you can start from a custom offset by adding a constant value to each offset:
offset = start_offset + monotonically_increasing_id()
Upvotes: 0