How to create a column of row id in Spark dataframe for each distinct column value using Scala

Question

I have a data frame in scala spark as

category | score |

A | 0.2

A | 0.3

B | 0.9

B | 0.8

B | 1

I would like to add a row id column as

category | score | row-id

A | 0.2 | 0

A | 0.3 | 1

A | 0.3 | 2

B | 0.9 | 0

B | 0.8 | 1

B | 1 | 2

Basically I want the row id to be monotonically increasing for each distinct value in column category. I already have a sorted dataframe so all the rows with same category are grouped together. However, I still don't know how to generate the row_id that restarts when a new category appears. Please help!

Jon Deaton · Accepted Answer

This is a good use case for Window aggregation functions

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import df.sparkSession.implicits._

val window = Window.partitionBy('category).orderBy('score)
df.withColumn("row-id", row_number.over(window))

Window functions work kind of like groupBy except that instead of each group returning a single value, each row in each group returns a single value. In this case the value is the row's position within the group of rows of the same category. Also, if this is the effect that you are trying to achieve, then you don't need to have pre-sorted the column category beforehand.

How to create a column of row id in Spark dataframe for each distinct column value using Scala

Answers (1)

Related Questions