Reputation: 479
My dataframe is as follows:
cola, colb
1, 2
1, 3
2, 1
2, 5
I want to add a column count
:
cola, colb, count
1, 2, 1
1, 3, 2
2, 1, 1
2, 5, 2
Need to groupby colA and set count to 1 for first record, then increment each successive row by 1.
I tried to use a window function but that applies the same count for all records in colA without the increment.
Upvotes: 1
Views: 1038
Reputation: 19550
You already figured out that a window function is the way to go. Maybe you haven't used the rank function.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [(1 , 2),
(1 , 3 ),
(1 , 2 ),
(2 , 1 ),
(2 , 5)]
columns = ['cola', 'colb']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('cola').orderBy('colb')
df = df.withColumn('count', F.rank().over(w))
df.show()
Output:
+----+----+-----+
|cola|colb|count|
+----+----+-----+
| 1| 2| 1|
| 1| 2| 1|
| 1| 3| 3|
| 2| 1| 1|
| 2| 5| 2|
+----+----+-----+
In case you don't want gaps after equal rows, you should use the dense_rank function.
Upvotes: 4