Adding a column counting cumulative pervious repeating values

Question

The question is in the title but how do you create a new column in Pyspark which counts cumulatively the number of previous repeating values?

For instance:

| Value|
| 0    |
| 0    |
| 5    |
| 5    |
| -1   |
| 0    |
| 0    |
| 0    |

Applying this to the value column would result in a new column of values

| Value | Result
|  0    |  1
|  0    |  2
|  5    |  1
|  5    |  2
|  -1   |  1
|  0    |  1
|  0    |  2
|  0    |  3

blackbishop · Accepted Answer

One solution is to use the difference between row numbers to create groups. And then use the group column to calculate row_number() as result:

from pyspark.sql import functions as F, Window

df = spark.createDataFrame([(0,), (0,), (5,), (5,), (-1,), (0,), (0,), (0,)], ["Value"])

df.withColumn("ID", F.monotonically_increasing_id()) \
    .withColumn("group",
            F.row_number().over(Window.orderBy("ID"))
            - F.row_number().over(Window.partitionBy("Value").orderBy("Value"))
    ) \
    .withColumn("Result", F.row_number().over(Window.partitionBy("group").orderBy("Value"))) \
    .drop("ID", "group")\
    .show()

#+-----+------+
#|Value|Result|
#+-----+------+
#|    0|     1|
#|    0|     2|
#|    5|     1|
#|    5|     2|
#|    0|     1|
#|    0|     2|
#|    0|     3|
#|   -1|     1|
#+-----+------+

Adding a column counting cumulative pervious repeating values

Answers (1)

Related Questions