Reputation: 208
I have a table with item_name, item_value and timestamps. I need to add one more column to the table which I will use as an index for the input for ml models. We have values with 5 min granularity of timestamp. this is the table I have currently
| item_name | item_value | timestamp |
|:---- |:------:| -----:|
| A | 0.25 | 2023-03-01T17:20:00.000+0000 |
| B | 0.34 | 2023-03-01T17:20:00.000+0000 |
| A | 0.30 | 2023-03-01T17:25:00.000+0000 |
| B | 0.54 | 2023-03-01T17:25:00.000+0000 |
| A | 0.30 | 2023-03-01T17:30:00.000+0000 |
| B | 0.54 | 2023-03-01T17:30:00.000+0000 |
and I want to add a column named index to the table like this
| item_name | item_value | timestamp | index |
|:---- |:------:| -----:| -----:|
| A | 0.25 | 2023-03-01T17:20:00.000+0000 |0|
| B | 0.34 | 2023-03-01T17:20:00.000+0000 |0|
| A | 0.30 | 2023-03-01T17:25:00.000+0000 |1|
| B | 0.54 | 2023-03-01T17:25:00.000+0000 |1|
| A | 0.30 | 2023-03-01T17:30:00.000+0000 |2|
| B | 0.54 | 2023-03-01T17:30:00.000+0000 |2|
Any help would be appreciated!
Upvotes: 0
Views: 129
Reputation: 758
You can use Window functions with dense_rank to achieve it by doing something like this,
import pyspark.sql.functions as F
from pyspark.sql import Window
w = Window().orderBy(F.col('timestamp'))
df = df.withColumn('idx', (F.dense_rank().over(w) - 1))
df.show(truncate=False)
Output:
+---------+----------+----------------------------+---+
|item_name|item_value|timestamp |idx|
+---------+----------+----------------------------+---+
|A |0.25 |2023-03-01T17:20:00.000+0000|0 |
|B |0.34 |2023-03-01T17:20:00.000+0000|0 |
|A |0.3 |2023-03-01T17:25:00.000+0000|1 |
|B |0.54 |2023-03-01T17:25:00.000+0000|1 |
|A |0.3 |2023-03-01T17:30:00.000+0000|2 |
|B |0.54 |2023-03-01T17:30:00.000+0000|2 |
+---------+----------+----------------------------+---+
Upvotes: 1