MMV
MMV

Reputation: 208

how to truncate the timestamp to hour and store the column value as a variable in Pyspark

I want to store a column value in a table as a variable. The table looks like this below.

+----------------------------+--------------------------------+------------------+
|timestamp                   |item_name                       |item_value        |
+----------------------------+--------------------------------+------------------+
|2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]|
|2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] |
|2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]|
+----------------------------+--------------------------------+------------------+

I want to truncate the timestamp to the hour 2023-03-01T17:00:00.000+0000 and store it as a variable.

the outcome I want is

print(timestamp_hour)
'2023-03-01T17:00:00.000+0000'

Upvotes: 0

Views: 279

Answers (3)

ScootCork
ScootCork

Reputation: 3686

You can do so using the date_trunc function and then collecting the result. I've applied a distinct to reduce the data to 1 row before collecting (or multiple if you have multiple hours in the timestamp column).

After collecting we select the first column from the first row.

import pyspark.sql.functions as F
timestamp_hour = df.select(F.date_trunc('hour', 'timestamp')).distinct().collect()[0][0]

Upvotes: 0

Tourelou
Tourelou

Reputation: 123

table = """
+----------------------------+--------------------------------+------------------+
|timestamp                   |item_name                       |item_value        |
+----------------------------+--------------------------------+------------------+
|2023-03-01T17:35:00.000+0000|[asset_min, hpc_max, off_median]|[0.32, 0.67, 0.32]|
|2023-03-01T17:30:00.000+0000|[asset_min, hpc_max, off_median]|[0.54, 0.54, 0.3] |
|2023-03-01T17:20:00.000+0000|[asset_min, hpc_max, off_median]|[0.34, 0.25, 0.67]|
+----------------------------+--------------------------------+------------------+
"""
def timestamp_hour(t_stamp: str) -> str:
    return f'{t_stamp[:13]}:00:00.000+0000'

ltable = table.splitlines()
for line in ltable:
    if '2023-' in line:
        print(timestamp_hour(line[1:28]))

Output

2023-03-01T17:00:00.000+0000
2023-03-01T17:00:00.000+0000
2023-03-01T17:00:00.000+0000

Upvotes: 0

Tourelou
Tourelou

Reputation: 123

May be something like this:

ts = '2023-03-01T17:35:00.000+0000'
def timestamp_hour(t_stamp: str) -> str:
    return f'{t_stamp[:13]}:00:00.000+0000'

print(timestamp_hour(ts))

Output:

2023-03-01T17:00:00.000+0000

Upvotes: -1

Related Questions