Add a row_number column to a partitioned Spark dataframe

Question

I am trying to add a column containing the row_num in a partitioned dataframe.

Initially, I read my delta data from Azure blob:

var df = spark.read.format("delta").load(path)

This data is partitioned on a date column:

df.rdd.getNumPartitions
res28: Int = 5

So when I try to add a row_num column:

df=df.withColumn("id", monotonically_increasing_id()

It generates 5 different sequences (one per partition) which is obviously not what I need.

My question is: is there anyway to generate a proper row num column on a partitioned dataframe ?

I am thinking about using something like this:

df=df.coalesce(1).withColumn("id", monotonically_increasing_id()

But I don't know if it is safe for the rest of my code to do it neither if it is best practice.

Thank you!

koiralo · Accepted Answer

You can use window function with row_number as below

val window = Window.partitionBy("date")

df.withColumn("id", row_number().over(window))

Answers (1)