Reputation: 121
i'm trying to create a index in a dataframe with pyspark, windown and row_number function.
For example:
Original dataframe
Obs: the data are random
Coldata |
---|
A |
B |
C |
D |
E |
F |
G |
H |
I |
Expected Dataframe:
Coldata | index |
---|---|
A | 1 |
B | 1 |
C | 1 |
D | 2 |
E | 2 |
F | 2 |
G | 3 |
H | 3 |
I | 3 |
My Code in moment is:
w = Window.orderBy("Coldata")
df_expected= df.withColumn("index", row_number().over(w))
But this returns 1,2,3,4,5
Upvotes: 0
Views: 346
Reputation: 42352
You can calculate (row_number + 2) / 3
and cast to integer:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'index',
((F.row_number().over(Window.orderBy('Coldata')) + 2) / 3).cast('int')
)
df2.show()
+-------+-----+
|colData|index|
+-------+-----+
| A| 1|
| B| 1|
| C| 1|
| D| 2|
| E| 2|
| F| 2|
| G| 3|
| H| 3|
| I| 3|
+-------+-----+
Upvotes: 1