yyuankm
yyuankm

Reputation: 365

How to add a column with duplicate sequence number for spark dataframe in scala?

I need add a column to a spark dataframe, which should be duplicate sequence number, such as [1, 1, 1, 2, 2, 2, 3, 3, 3, ..., 10000, 10000, 10000]. I knew that we can use monotonically_increasing_id to get the sequence number as new column.

val df_new =  df.withColumn("id", monotonically_increasing_id)

Then, what is the solution to extend this function to get the duplicate sequence number? Thanks!

Upvotes: 0

Views: 490

Answers (1)

mck
mck

Reputation: 42332

You can calculate a row number, divide that by 3, cast to integer type, and add 1:

import org.apache.spark.sql.expressions.Window

val df_new = df.withColumn(
    "id", 
    (row_number().over(Window.orderBy(monotonically_increasing_id))/3).cast("int") + 1
)

Upvotes: 2

Related Questions