Reputation: 43
The question is in the title but how do you create a new column in Pyspark which counts cumulatively the number of previous repeating values?
For instance:
| Value|
| 0 |
| 0 |
| 5 |
| 5 |
| -1 |
| 0 |
| 0 |
| 0 |
Applying this to the value
column would result in a new column of values
| Value | Result
| 0 | 1
| 0 | 2
| 5 | 1
| 5 | 2
| -1 | 1
| 0 | 1
| 0 | 2
| 0 | 3
Upvotes: 2
Views: 1062
Reputation: 32670
One solution is to use the difference between row numbers to create groups. And then use the group
column to calculate row_number()
as result:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([(0,), (0,), (5,), (5,), (-1,), (0,), (0,), (0,)], ["Value"])
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Value").orderBy("Value"))
) \
.withColumn("Result", F.row_number().over(Window.partitionBy("group").orderBy("Value"))) \
.drop("ID", "group")\
.show()
#+-----+------+
#|Value|Result|
#+-----+------+
#| 0| 1|
#| 0| 2|
#| 5| 1|
#| 5| 2|
#| 0| 1|
#| 0| 2|
#| 0| 3|
#| -1| 1|
#+-----+------+
Upvotes: 2