Reputation: 51
I want to create a new column in PySpark DataFrame with N repeating row numbers irrespective of other columns in the data frame.
Original data:
name year
A 2010
A 2011
A 2011
A 2013
A 2014
A 2015
A 2016
A 2018
B 2018
B 2019
I want to have a new column with N repeating row number, consider N=3.
Expected Output:
name year rownumber
A 2010 1
A 2011 1
A 2011 1
A 2013 2
A 2014 2
A 2015 2
A 2016 3
A 2018 3
B 2018 3
B 2019 4
Upvotes: 3
Views: 926
Reputation: 75150
You can try row number with division:
n=3
df.withColumn("rounum",
((F.row_number().over(Window.orderBy(F.lit(0)))-1)/n).cast("Integer")+1).show()
+----+----+------+
|name|year|rounum|
+----+----+------+
| A|2010| 1|
| A|2011| 1|
| A|2011| 1|
| A|2013| 2|
| A|2014| 2|
| A|2015| 2|
| A|2016| 3|
| A|2018| 3|
| B|2018| 3|
| B|2019| 4|
+----+----+------+
Upvotes: 3