Reputation: 1019
I am trying to add a column containing the row_num in a partitioned dataframe.
Initially, I read my delta data from Azure blob:
var df = spark.read.format("delta").load(path)
This data is partitioned on a date column:
df.rdd.getNumPartitions
res28: Int = 5
So when I try to add a row_num column:
df=df.withColumn("id", monotonically_increasing_id()
It generates 5 different sequences (one per partition) which is obviously not what I need.
My question is: is there anyway to generate a proper row num column on a partitioned dataframe ?
I am thinking about using something like this:
df=df.coalesce(1).withColumn("id", monotonically_increasing_id()
But I don't know if it is safe for the rest of my code to do it neither if it is best practice.
Thank you!
Upvotes: 0
Views: 768
Reputation: 23119
You can use window
function with row_number
as below
val window = Window.partitionBy("date")
df.withColumn("id", row_number().over(window))
Upvotes: 1