Create kind of index in Pyspark with window and row_number

Question

i'm trying to create a index in a dataframe with pyspark, windown and row_number function.

For example:

Original dataframe

Obs: the data are random

Coldata
A
B
C
D
E
F
G
H
I

Expected Dataframe:

Coldata	index
A	1
B	1
C	1
D	2
E	2
F	2
G	3
H	3
I	3

My Code in moment is:

w = Window.orderBy("Coldata")
df_expected= df.withColumn("index",  row_number().over(w))

But this returns 1,2,3,4,5

mck · Accepted Answer

You can calculate (row_number + 2) / 3 and cast to integer:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'index',
    ((F.row_number().over(Window.orderBy('Coldata')) + 2) / 3).cast('int')
)

df2.show()
+-------+-----+
|colData|index|
+-------+-----+
|      A|    1|
|      B|    1|
|      C|    1|
|      D|    2|
|      E|    2|
|      F|    2|
|      G|    3|
|      H|    3|
|      I|    3|
+-------+-----+

Create kind of index in Pyspark with window and row_number

Answers (1)

Related Questions