Davoud Malekahmadi
Davoud Malekahmadi

Reputation: 63

How to use a column's name in an expr() instead of direct value for add date in Pyspark?

I want to add the value of a column (add_value) which is based on seconds, to a timestamp column (start_date).

+-----------------------+----------+
|       start_date      | add_value|
+-----------------------+----------+
|2022-09-05 00:00:01.394|  6.001   |
+-----------------------+----------+

Using the below code for direct value works fine.

(col('start_date') + expr('INTERVAL 6.001 seconds')).alias("end_date")

But replacing it with the column's name has an error.

(col('start_date') + expr('INTERVAL add_value seconds')).alias("end_date")

error:

Syntax error at or near 'seconds': extra input 'seconds'(line 1, pos 22)

== SQL == INTERVAL add_value seconds ----------------------^^^

How should I use the column's name?

Upvotes: 2

Views: 1222

Answers (2)

blackbishop
blackbishop

Reputation: 32700

To create intervals using existing columns you can concatenate 'interval' + add_value + 'seconds' and cast the resulting string into INTERVAL type. Then you can add it to timestamp start_date:

from pyspark.sql import functions as F

df.withColumn(
    'end_date',
    F.col('start_date') + F.format_string('interval %s seconds', 'add_value').cast('interval')
).show()

# +-----------------------+---------+-----------------------+
# |start_date             |add_value|end_date               |
# +-----------------------+---------+-----------------------+
# |2022-09-05 00:00:01.394|6.001    |2022-09-05 00:00:07.395|
# +-----------------------+---------+-----------------------+

Upvotes: 3

samkart
samkart

Reputation: 6654

INTERVAL doesn't let you use values from a column. But an easy approach can be that you cast the timestamp as double and then add the decimal seconds. Convert the resulting double value to a timestamp to get the desired format.

spark.sparkContext.parallelize([('2022-09-05 00:00:01.394', 6.001)]).toDF(['start_date', 'add_value']). \
    withColumn('start_date', func.to_timestamp('start_date')). \
    withColumn('start_date_added_value', 
               (func.col('start_date').cast('double') + func.col('add_value')).cast('timestamp')
               ). \
    show(truncate=False)

# +-----------------------+---------+-----------------------+
# |start_date             |add_value|start_date_added_value |
# +-----------------------+---------+-----------------------+
# |2022-09-05 00:00:01.394|6.001    |2022-09-05 00:00:07.395|
# +-----------------------+---------+-----------------------+

Upvotes: 1

Related Questions