Venkata Krishnan
Venkata Krishnan

Reputation: 51

Generate repeating N row number for a PySpark DataFrame

I want to create a new column in PySpark DataFrame with N repeating row numbers irrespective of other columns in the data frame.

Original data:

name year 
A   2010
A   2011
A   2011
A   2013
A   2014
A   2015
A   2016
A   2018
B   2018
B   2019

I want to have a new column with N repeating row number, consider N=3.

Expected Output:

name year  rownumber
A   2010   1
A   2011   1
A   2011   1
A   2013   2
A   2014   2
A   2015   2
A   2016   3
A   2018   3
B   2018   3
B   2019   4

Upvotes: 3

Views: 926

Answers (1)

anky
anky

Reputation: 75150

You can try row number with division:

n=3
df.withColumn("rounum",
   ((F.row_number().over(Window.orderBy(F.lit(0)))-1)/n).cast("Integer")+1).show()

+----+----+------+
|name|year|rounum|
+----+----+------+
|   A|2010|     1|
|   A|2011|     1|
|   A|2011|     1|
|   A|2013|     2|
|   A|2014|     2|
|   A|2015|     2|
|   A|2016|     3|
|   A|2018|     3|
|   B|2018|     3|
|   B|2019|     4|
+----+----+------+

Upvotes: 3

Related Questions