DBA108642
DBA108642

Reputation: 2112

Pyspark handle multiple datetime formats when casting from string to timestamp

I have the following sample dataframe:

+------------------+-----------+
|order_completed_at|static_date|
+------------------+-----------+
|6/16/2021 21:29   |2021-10-10 |
|6/7/2021 9:29     |2021-10-10 |
|6/12/2021 15:35   |2021-10-10 |
|6/18/2021 22:25   |2021-10-10 |
|6/16/2021 5:25    |2021-10-10 |
+------------------+-----------+

where both fields are of type string. I need to cast these to timestamps, which I can do with the following code:

from pyspark.sql import functions as sql_functions

order_dates = order_dates.withColumn("order_completed_at_test", sql_functions.when() sql_functions.unix_timestamp(
        sql_functions.col('order_completed_at'), "MM/dd/yyyy").cast("timestamp"))

order_dates = order_dates.withColumn("static_date_test", sql_functions.to_timestamp("static_date"))

However, within the order_completed_at column, there can be a mix of formats, such that it could be MM/dd/yyyy or yyyy-MM-dd

Is it possible to write one expression that can interpret both date time formats?

EDIT:

I'm going to close this question as it's a very slippery slope to go down this road dealing with dynamic date formats, as both answers have noted. I will be asking the client to make changes to the source data

Upvotes: 4

Views: 898

Answers (2)

Sanket9394
Sanket9394

Reputation: 2091

If you are completely sure that there just 2 formats then you can use where otherwise .

Example :

from pyspark.sql.functions import col
from pyspark.sql.functions import when

order_dates.withColumn("datetest", \
    when(col("order_completed_at").rlike("\d+/\d+/\d+"), #YOUR_TIMESTAMP_CAST_OP_FOR_MM/dd/yyyy )\
    .otherwise(#YOUR_TIMESTAMP_CAST_OP_FOR_yyyy-MM-dd))

Upvotes: 5

Thomas Kimber
Thomas Kimber

Reputation: 11107

This looks like a candidate for a udf (e.g. https://spark.apache.org/docs/2.4.6/api/python/_modules/pyspark/sql/udf.html) functions wrapper around the python dateparser module.

Essentially, you want to be able to apply a function call along the lines of :

dateparser.parse("order_completed_at")

as part of your sql:

However - whilst this will hoover up all your dates, there is no guarantee that a d/m/y date wont be misinterpreted as an m/d/y date, or vice versa. Mixing date formats is problematic, because they are inherently ambiguous. Ideally, you'd parse if there's only a single interpretation, but flag as problematic any that could be interpreted in one of a number of ways.

The problem as presented (mix of y-m-d and m/d/y dates) is probably ok on this front though. As always, best to be cautious.

Upvotes: 3

Related Questions