Pyspark counter field, groupby and increment by 1

Question

My dataframe is as follows:

cola, colb
1, 2
1, 3
2, 1
2, 5

I want to add a column count:

cola, colb, count
1,     2,    1
1,     3,    2
2,     1,    1
2,     5,    2

Need to groupby colA and set count to 1 for first record, then increment each successive row by 1.

I tried to use a window function but that applies the same count for all records in colA without the increment.

cronoik · Accepted Answer

You already figured out that a window function is the way to go. Maybe you haven't used the rank function.

import pyspark.sql.functions as F
from pyspark.sql import Window

l = [(1       , 2),
(1      , 3  ),
(1      , 2  ),
(2      , 1  ),
(2      , 5)]

columns = ['cola', 'colb']

df=spark.createDataFrame(l, columns)

w = Window.partitionBy('cola').orderBy('colb')

df = df.withColumn('count', F.rank().over(w))
df.show()

Output:

+----+----+-----+ 
|cola|colb|count| 
+----+----+-----+ 
|   1|   2|    1| 
|   1|   2|    1| 
|   1|   3|    3| 
|   2|   1|    1| 
|   2|   5|    2|
+----+----+-----+

In case you don't want gaps after equal rows, you should use the dense_rank function.

Pyspark counter field, groupby and increment by 1

Answers (1)

Related Questions