novice8989
novice8989

Reputation: 169

Add days to Timestamp column

I am trying to add days to timestamp column but I'm losing the hours from the original timestamp column after adding days. Any pointers on how to retain the hours in the final output?

var Df = Seq(
(1, "2020-04-01 14:31:10",0),
(2, "2020-04-15 14:31:10",-1),
(3, "2020-04-16 03:31:10",3),
(6, "2020-03-01 14:31:10",30)    
 ).toDF("Id", "startDt", "Flag")

Df = (Df.withColumn("startDt",to_timestamp($"startDt")))

(Df.withColumn("newdate",when($"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30
        && $"startDt".isNotNull, expr("date_add(startDt,Flag)").cast(TimestampType)).otherwise(lit(null))
                  )
).show()

Expected result:

+---+-------------------+----+-------------------+
| Id|            startDt|Flag|            newdate|
+---+-------------------+----+-------------------+
|  1|2020-04-01 14:31:10|   0|               null|
|  2|2020-04-15 14:31:10|  -1|               null|
|  3|2020-04-16 03:31:10|   3|2020-04-19 03:31:10|
|  6|2020-03-01 14:31:10|  30|2020-03-31 14:30:10|

Instead, given result:

+---+-------------------+----+-------------------+
| Id|            startDt|Flag|            newdate|
+---+-------------------+----+-------------------+
|  1|2020-04-01 14:31:10|   0|               null|
|  2|2020-04-15 14:31:10|  -1|               null|
|  3|2020-04-16 03:31:10|   3|2020-04-19 00:00:00|
|  6|2020-03-01 14:31:10|  30|2020-03-31 00:00:00|

I can get the desired output using an UDF, but is there any built-in function to achieve the same?

val AddDaysToTimeStamp = udf((x: java.sql.Timestamp, y: Int) => {val result = new Timestamp(x.getTime() + TimeUnit.DAYS.toMillis(y)) result})

Upvotes: 1

Views: 1249

Answers (2)

mck
mck

Reputation: 42342

You can add a interval to the timestamp, where the interval is constructed from the flag column:

import org.apache.spark.sql.types._

val df2 = Df.withColumn(
    "newdate", 
    when(
        $"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30 && $"startDt".isNotNull, 
        $"startDt" + concat($"flag".cast("string"), lit(" days")).cast(CalendarIntervalType)
    )
)

df2.show
+---+-------------------+----+-------------------+
| Id|            startDt|Flag|            newdate|
+---+-------------------+----+-------------------+
|  1|2020-04-01 14:31:10|   0|               null|
|  2|2020-04-15 14:31:10|  -1|               null|
|  3|2020-04-16 03:31:10|   3|2020-04-19 03:31:10|
|  6|2020-03-01 14:31:10|  30|2020-03-31 14:31:10|
+---+-------------------+----+-------------------+

Upvotes: 1

Leo C
Leo C

Reputation: 22439

I'm losing the hours from the original timestamp column after adding days

That's because date_add always returns DateType (rather than TimestampType), even though it takes either type as input.

A couple of approaches using Spark API if using of UDF isn't preferred:

Approach 1: Convert to Unix epoch time and modify in seconds (*)

Df.
  withColumn("startDt", to_timestamp($"startDt")).
  withColumn(
    "newdate",
    when($"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30 && $"startDt".isNotNull,
      to_timestamp(unix_timestamp($"startDt") + $"Flag" * 3600 * 24)
    )
  ).show
// +---+-------------------+----+-------------------+
// | Id|            startDt|Flag|            newdate|
// +---+-------------------+----+-------------------+
// |  1|2020-04-01 14:31:10|   0|               null|
// |  2|2020-04-15 14:31:10|  -1|               null|
// |  3|2020-04-16 03:31:10|   3|2020-04-19 03:31:10|
// |  6|2020-03-01 14:31:10|  30|2020-03-31 15:31:10|  <-- +1 hour due to daylight savings
// +---+-------------------+----+-------------------+

(*) If the period of the added/subtracted days crosses a Daylight saving time, the hour value will be shifted.

Approach 2: Split the datetime string and modify date & time separately

Df.
  withColumn("splitTS", split($"startDt", "\\s+")).
  withColumn(
    "newdate",
    when($"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30,
      concat(expr("date_add(to_date(splitTS[0]), Flag)"), lit(" "), $"splitTS"(1))
    )
  ).show
// +---+-------------------+----+--------------------+-------------------+
// | Id|            startDt|Flag|             splitTS|            newdate|
// +---+-------------------+----+--------------------+-------------------+
// |  1|2020-04-01 14:31:10|   0|[2020-04-01, 14:3...|               null|
// |  2|2020-04-15 14:31:10|  -1|[2020-04-15, 14:3...|               null|
// |  3|2020-04-16 03:31:10|   3|[2020-04-16, 03:3...|2020-04-19 03:31:10|
// |  6|2020-03-01 14:31:10|  30|[2020-03-01, 14:3...|2020-03-31 14:31:10|
// +---+-------------------+----+--------------------+-------------------+

Note that validation of the startDt string (e.g. via a Regex timestamp pattern matching) could be added, if necessary.

Upvotes: 1

Related Questions