add index following the value of timestamps pyspark

Question

I have a table with item_name, item_value and timestamps. I need to add one more column to the table which I will use as an index for the input for ml models. We have values with 5 min granularity of timestamp. this is the table I have currently

| item_name | item_value | timestamp |
|:---- |:------:| -----:|
|  A | 0.25    | 2023-03-01T17:20:00.000+0000 |
|  B | 0.34    | 2023-03-01T17:20:00.000+0000 |
|  A | 0.30    | 2023-03-01T17:25:00.000+0000 |
|  B | 0.54    | 2023-03-01T17:25:00.000+0000 |
|  A | 0.30    | 2023-03-01T17:30:00.000+0000 |
|  B | 0.54    | 2023-03-01T17:30:00.000+0000 |

and I want to add a column named index to the table like this

| item_name | item_value | timestamp | index |
|:---- |:------:| -----:| -----:|
|  A | 0.25    | 2023-03-01T17:20:00.000+0000 |0|
|  B | 0.34    | 2023-03-01T17:20:00.000+0000 |0|
|  A | 0.30    | 2023-03-01T17:25:00.000+0000 |1|
|  B | 0.54    | 2023-03-01T17:25:00.000+0000 |1|
|  A | 0.30    | 2023-03-01T17:30:00.000+0000 |2|
|  B | 0.54    | 2023-03-01T17:30:00.000+0000 |2|

Any help would be appreciated!

Tushar Patil · Accepted Answer

You can use Window functions with dense_rank to achieve it by doing something like this,

import pyspark.sql.functions as F
from pyspark.sql import Window

w = Window().orderBy(F.col('timestamp'))

df = df.withColumn('idx', (F.dense_rank().over(w) - 1))

df.show(truncate=False)

Output:

+---------+----------+----------------------------+---+
|item_name|item_value|timestamp                   |idx|
+---------+----------+----------------------------+---+
|A        |0.25      |2023-03-01T17:20:00.000+0000|0  |
|B        |0.34      |2023-03-01T17:20:00.000+0000|0  |
|A        |0.3       |2023-03-01T17:25:00.000+0000|1  |
|B        |0.54      |2023-03-01T17:25:00.000+0000|1  |
|A        |0.3       |2023-03-01T17:30:00.000+0000|2  |
|B        |0.54      |2023-03-01T17:30:00.000+0000|2  |
+---------+----------+----------------------------+---+

add index following the value of timestamps pyspark

Answers (1)

Related Questions