Thiago Bueno
Thiago Bueno

Reputation: 121

Create kind of index in Pyspark with window and row_number

i'm trying to create a index in a dataframe with pyspark, windown and row_number function.

For example:

Original dataframe

Obs: the data are random

Coldata
A
B
C
D
E
F
G
H
I

Expected Dataframe:

Coldata index
A 1
B 1
C 1
D 2
E 2
F 2
G 3
H 3
I 3

My Code in moment is:

w = Window.orderBy("Coldata")
df_expected= df.withColumn("index",  row_number().over(w))

But this returns 1,2,3,4,5

Upvotes: 0

Views: 346

Answers (1)

mck
mck

Reputation: 42352

You can calculate (row_number + 2) / 3 and cast to integer:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'index',
    ((F.row_number().over(Window.orderBy('Coldata')) + 2) / 3).cast('int')
)

df2.show()
+-------+-----+
|colData|index|
+-------+-----+
|      A|    1|
|      B|    1|
|      C|    1|
|      D|    2|
|      E|    2|
|      F|    2|
|      G|    3|
|      H|    3|
|      I|    3|
+-------+-----+

Upvotes: 1

Related Questions