Reputation: 169
I am trying to add days to timestamp column but I'm losing the hours from the original timestamp column after adding days. Any pointers on how to retain the hours in the final output?
var Df = Seq(
(1, "2020-04-01 14:31:10",0),
(2, "2020-04-15 14:31:10",-1),
(3, "2020-04-16 03:31:10",3),
(6, "2020-03-01 14:31:10",30)
).toDF("Id", "startDt", "Flag")
Df = (Df.withColumn("startDt",to_timestamp($"startDt")))
(Df.withColumn("newdate",when($"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30
&& $"startDt".isNotNull, expr("date_add(startDt,Flag)").cast(TimestampType)).otherwise(lit(null))
)
).show()
Expected result:
+---+-------------------+----+-------------------+
| Id| startDt|Flag| newdate|
+---+-------------------+----+-------------------+
| 1|2020-04-01 14:31:10| 0| null|
| 2|2020-04-15 14:31:10| -1| null|
| 3|2020-04-16 03:31:10| 3|2020-04-19 03:31:10|
| 6|2020-03-01 14:31:10| 30|2020-03-31 14:30:10|
Instead, given result:
+---+-------------------+----+-------------------+
| Id| startDt|Flag| newdate|
+---+-------------------+----+-------------------+
| 1|2020-04-01 14:31:10| 0| null|
| 2|2020-04-15 14:31:10| -1| null|
| 3|2020-04-16 03:31:10| 3|2020-04-19 00:00:00|
| 6|2020-03-01 14:31:10| 30|2020-03-31 00:00:00|
I can get the desired output using an UDF, but is there any built-in function to achieve the same?
val AddDaysToTimeStamp = udf((x: java.sql.Timestamp, y: Int) => {val result = new Timestamp(x.getTime() + TimeUnit.DAYS.toMillis(y)) result})
Upvotes: 1
Views: 1249
Reputation: 42342
You can add a interval to the timestamp, where the interval is constructed from the flag
column:
import org.apache.spark.sql.types._
val df2 = Df.withColumn(
"newdate",
when(
$"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30 && $"startDt".isNotNull,
$"startDt" + concat($"flag".cast("string"), lit(" days")).cast(CalendarIntervalType)
)
)
df2.show
+---+-------------------+----+-------------------+
| Id| startDt|Flag| newdate|
+---+-------------------+----+-------------------+
| 1|2020-04-01 14:31:10| 0| null|
| 2|2020-04-15 14:31:10| -1| null|
| 3|2020-04-16 03:31:10| 3|2020-04-19 03:31:10|
| 6|2020-03-01 14:31:10| 30|2020-03-31 14:31:10|
+---+-------------------+----+-------------------+
Upvotes: 1
Reputation: 22439
I'm losing the hours from the original timestamp column after adding days
That's because date_add
always returns DateType
(rather than TimestampType
), even though it takes either type as input.
A couple of approaches using Spark API if using of UDF isn't preferred:
Approach 1: Convert to Unix epoch time and modify in seconds (*)
Df.
withColumn("startDt", to_timestamp($"startDt")).
withColumn(
"newdate",
when($"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30 && $"startDt".isNotNull,
to_timestamp(unix_timestamp($"startDt") + $"Flag" * 3600 * 24)
)
).show
// +---+-------------------+----+-------------------+
// | Id| startDt|Flag| newdate|
// +---+-------------------+----+-------------------+
// | 1|2020-04-01 14:31:10| 0| null|
// | 2|2020-04-15 14:31:10| -1| null|
// | 3|2020-04-16 03:31:10| 3|2020-04-19 03:31:10|
// | 6|2020-03-01 14:31:10| 30|2020-03-31 15:31:10| <-- +1 hour due to daylight savings
// +---+-------------------+----+-------------------+
(*) If the period of the added/subtracted days crosses a Daylight saving time, the hour value will be shifted.
Approach 2: Split the datetime string and modify date & time separately
Df.
withColumn("splitTS", split($"startDt", "\\s+")).
withColumn(
"newdate",
when($"Flag".isNotNull && $"Flag" > 0 && $"Flag" <= 30,
concat(expr("date_add(to_date(splitTS[0]), Flag)"), lit(" "), $"splitTS"(1))
)
).show
// +---+-------------------+----+--------------------+-------------------+
// | Id| startDt|Flag| splitTS| newdate|
// +---+-------------------+----+--------------------+-------------------+
// | 1|2020-04-01 14:31:10| 0|[2020-04-01, 14:3...| null|
// | 2|2020-04-15 14:31:10| -1|[2020-04-15, 14:3...| null|
// | 3|2020-04-16 03:31:10| 3|[2020-04-16, 03:3...|2020-04-19 03:31:10|
// | 6|2020-03-01 14:31:10| 30|[2020-03-01, 14:3...|2020-03-31 14:31:10|
// +---+-------------------+----+--------------------+-------------------+
Note that validation of the startDt
string (e.g. via a Regex timestamp pattern matching) could be added, if necessary.
Upvotes: 1