Martin Tynan
Martin Tynan

Reputation: 53

Adding variable hours to timestamp in Spark SQL

I have one column Start_Time with a timestamp, and one column Time_Zone_Offset, an integer. How can I add the Time_Zone_Offset to Start_Time as a number of hours?

Example MyTable:

id  Start_Time            Time_Zone_Offset

1   2020-01-12 00:00:00     1
2   2020-01-12 00:00:00     2

Desired output:

id  Local_Start_Time

1   2020-01-12 01:00:00
2   2020-01-12 02:00:00

Attempt:

SELECT id, Start_time + INTERVAL time_zone_offset HOURS AS Local_Start_Time
FROM MyTable

This doesn't seem to work, and I can't use from_utc_timestamp as I don't have the actual timezone details, just the time-zone offset at the time being considered.

Upvotes: 3

Views: 13822

Answers (4)

iamdave
iamdave

Reputation: 12243

For anyone else coming to this question and using Spark SQL via Databricks, the dateadd function works in the same way as most other SQL languages:

select dateadd(microsecond,30,'2022-11-04') as microsecond
      ,dateadd(millisecond,30,'2022-11-04') as millisecond
      ,dateadd(second     ,30,'2022-11-04') as second
      ,dateadd(minute     ,30,'2022-11-04') as minute
      ,dateadd(hour       ,30,'2022-11-04') as hour
      ,dateadd(day        ,30,'2022-11-04') as day
      ,dateadd(week       ,30,'2022-11-04') as week
      ,dateadd(month      ,30,'2022-11-04') as month
      ,dateadd(quarter    ,30,'2022-11-04') as quarter
      ,dateadd(year       ,30,'2022-11-04') as year

Output

microsecond millisecond second minute hour day week month quarter year
2022-11-04T00:00:00.000+0000 2022-11-04T00:00:00.030+0000 2022-11-04T00:00:30.000+0000 2022-11-04T00:30:00.000+0000 2022-11-05T06:00:00.000+0000 2022-12-04T00:00:00.000+0000 2023-06-02T00:00:00.000+0000 2025-05-04T00:00:00.000+0000 2030-05-04T00:00:00.000+0000 2052-11-04T00:00:00.000+0000

Upvotes: 6

yxtay
yxtay

Reputation: 546

Beginning from Spark 3.0, you may use the make_interval(years, months, weeks, days, hours, mins, secs) function if you want to add intervals using values from other columns.

SELECT 
    id
    , Start_time + make_interval(0, 0, 0, 0, time_zone_offset, 0, 0) AS Local_Start_Time
FROM MyTable

Upvotes: 3

Martin Tynan
Martin Tynan

Reputation: 53

Adding an alternative to Benoit's answer using a python UDF:

from pyspark.sql import SQLContext
from datetime import datetime, timedelta
from pyspark.sql.types import TimestampType

# Defining pyspark function to add hours onto datetime
def addHours(my_datetime, hours):
    # Accounting for NULL (None in python) values
    if (hours is None) or (my_datetime is None):
        adjusted_datetime = None
    else:
        adjusted_datetime = my_datetime + timedelta(hours = hours)
    return adjusted_datetime

# Registering the function as a UDF to use in SQL, and defining the output type as 'TimestampType' (this is important, the default is StringType)
sqlContext.udf.register("add_hours", addHours, TimestampType());

followed by:

SELECT id, add_hours(Start_Time, Time_Zone_Offset) AS Local_Start_Time
FROM MyTable

Upvotes: 2

Benoit F
Benoit F

Reputation: 499

(Hope you are using pyspark)

In deed, coudn't make it work with SQL, I manage to get to the result by converting to timestamp, its probably not the best way but it works (i proceeded step by step to make sure the references were working, thought i would need a user defined function, but apparently not)

from pyspark.sql.functions import col,explode,lit
from pyspark.sql import functions as F
df2 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time"))
df2.show()
df3 = df.withColumn("Start_Time", F.unix_timestamp("Start_Time") + df["Time_Zone_Offset"]*60*60)
df3.show()
df4 = df3.withColumn('Start_Time', F.from_unixtime("Start_Time", "YYYY-MM-DD HH:00:00")).show()

Upvotes: 2

Related Questions