Reputation: 125
The format of most of my (string) dates in my pyspark column looks like this: Thu Jul 01 15:32:02 +0000 2021
I want to convert it into a date format like this: 01-07-2021 And I have found a way that works, unfortunately only when the column is clean aka when the string has the follwing format:'%a %b %d %H:%M:%S +0000 %Y'
This is the code I used:
from datetime import datetime
import pytz
from pyspark.sql.functions import udf, to_date, to_utc_timestamp
# Converting date string format
def getDate(x):
if x is not None:
return str(datetime.strptime(x,'%a %b %d %H:%M:%S +0000 %Y').replace(tzinfo=pytz.UTC).strftime("%Y-%m-%d %H:%M:%S"))
else:
return None
# UDF declaration
date_fn = udf(getDate, StringType())
# Converting datatype in spark dataframe
df = df.withColumn("date", to_utc_timestamp(date_fn("date"),"UTC"))
Is there some way I can add some code that detects non matching formats and then just delets the observation or turns it into null?
Thank you!
Upvotes: 1
Views: 1435
Reputation: 14845
Using to_date converts a string into a date using a given format. If the string does not match the format, the result will be null
.
There is a small restriction that to_date
cannot parse the day of the week:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
The easiest way is to remove the first three characters from the date string before using to_date
:
data = [["Thu Jul 01 15:32:02 +0000 2021"],
["Thu Jul 01 15:32:02 +0200 2021"],
["Thu Jul 01 15:32:02 2021"],
["2021-07-01 15:32:02"],
["this is not a valid time"]]
df = spark.createDataFrame(data, schema=["input"])
df.withColumn("date", F.to_date(F.substring("input",5,100),
"MMM dd HH:mm:ss xx yyyy")).show(truncate=False)
Output:
data = [["Thu Jul 01 15:32:02 +0000 2021"],...
+------------------------------+----------+
|input |date |
+------------------------------+----------+
|Thu Jul 01 15:32:02 +0000 2021|2021-07-01|
|Thu Jul 01 15:32:02 +0200 2021|2021-07-01|
|Thu Jul 01 15:32:02 2021 |null |
|2021-07-01 15:32:02 |null |
|this is not a valid time |null |
+------------------------------+----------+
Upvotes: 1