Checking datetime format of a column in dataframe

Question

I have an input dateframe, which has below data:

id   date_column
1     2011-07-09 11:29:31+0000
2     2011-07-09T11:29:31+0000
3     2011-07-09T11:29:31
4     2011-07-09T11:29:31+0000

I want to check whether format of date_column matches the format "%Y-%m-%dT%H:%M:%S+0000", if format matches, i want to add a column, which has value 1 otherwise 0. Currently, i have defined a UDF to do this operation:

def date_pattern_matching(value, pattern):
    try:
        datetime.strptime(str(value),pattern)
        return "1"
    except:
        return "0"

It generates below output dataframe:

id   date_column                       output
1     2011-07-09 11:29:31+0000           0
2     2011-07-09T11:29:31+0000           1
3     2011-07-09T11:29:31                0
4     2011-07-09T11:29:31+0000           1

Execution through UDF takes a lot of time, is there an alternate way to achieve it?

johnmdonich · Accepted Answer

Try the regex pyspark.sql.Column.rlike operator with a when otherwise block

from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'], 
    [1,"2011-07-09 11:29:31+0000"], 
    [2,"2011-07-09T11:29:31+0000"],
    [3,"2011-07-09T11:29:31"],
    [4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])


regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()

Output
+---+------------------------+------+
|id |date_column             |output|
+---+------------------------+------+
|1  |2011-07-09 11:29:31+0000|0     |
|1  |2011-07-09 11:29:31+0000|0     |
|2  |2011-07-09T11:29:31+0000|1     |
|3  |2011-07-09T11:29:31     |0     |
|4  |2011-07-09T11:29:31+0000|1     |
+---+------------------------+------+

Checking datetime format of a column in dataframe

Answers (1)

Related Questions