Reputation: 2112
I have the following sample dataframe:
+------------------+-----------+
|order_completed_at|static_date|
+------------------+-----------+
|6/16/2021 21:29 |2021-10-10 |
|6/7/2021 9:29 |2021-10-10 |
|6/12/2021 15:35 |2021-10-10 |
|6/18/2021 22:25 |2021-10-10 |
|6/16/2021 5:25 |2021-10-10 |
+------------------+-----------+
where both fields are of type string. I need to cast these to timestamps, which I can do with the following code:
from pyspark.sql import functions as sql_functions
order_dates = order_dates.withColumn("order_completed_at_test", sql_functions.when() sql_functions.unix_timestamp(
sql_functions.col('order_completed_at'), "MM/dd/yyyy").cast("timestamp"))
order_dates = order_dates.withColumn("static_date_test", sql_functions.to_timestamp("static_date"))
However, within the order_completed_at
column, there can be a mix of formats, such that it could be MM/dd/yyyy
or yyyy-MM-dd
Is it possible to write one expression that can interpret both date time formats?
EDIT:
I'm going to close this question as it's a very slippery slope to go down this road dealing with dynamic date formats, as both answers have noted. I will be asking the client to make changes to the source data
Upvotes: 4
Views: 898
Reputation: 2091
If you are completely sure that there just 2 formats then you can use where otherwise
.
Example :
from pyspark.sql.functions import col
from pyspark.sql.functions import when
order_dates.withColumn("datetest", \
when(col("order_completed_at").rlike("\d+/\d+/\d+"), #YOUR_TIMESTAMP_CAST_OP_FOR_MM/dd/yyyy )\
.otherwise(#YOUR_TIMESTAMP_CAST_OP_FOR_yyyy-MM-dd))
Upvotes: 5
Reputation: 11107
This looks like a candidate for a udf (e.g. https://spark.apache.org/docs/2.4.6/api/python/_modules/pyspark/sql/udf.html) functions wrapper around the python dateparser module.
Essentially, you want to be able to apply a function call along the lines of :
dateparser.parse("order_completed_at")
as part of your sql:
However - whilst this will hoover up all your dates, there is no guarantee that a d/m/y date wont be misinterpreted as an m/d/y date, or vice versa. Mixing date formats is problematic, because they are inherently ambiguous. Ideally, you'd parse if there's only a single interpretation, but flag as problematic any that could be interpreted in one of a number of ways.
The problem as presented (mix of y-m-d and m/d/y dates) is probably ok on this front though. As always, best to be cautious.
Upvotes: 3