Reputation: 365
I need add a column to a spark dataframe, which should be duplicate sequence number, such as [1, 1, 1, 2, 2, 2, 3, 3, 3, ..., 10000, 10000, 10000]
. I knew that we can use monotonically_increasing_id
to get the sequence number as new column.
val df_new = df.withColumn("id", monotonically_increasing_id)
Then, what is the solution to extend this function to get the duplicate sequence number? Thanks!
Upvotes: 0
Views: 490
Reputation: 42332
You can calculate a row number, divide that by 3, cast to integer type, and add 1:
import org.apache.spark.sql.expressions.Window
val df_new = df.withColumn(
"id",
(row_number().over(Window.orderBy(monotonically_increasing_id))/3).cast("int") + 1
)
Upvotes: 2