pyspark - why convert string(time) to timestamp return None?

Question

I have below sample pyspark dataframe, and want to extract the time from message column, and then convert the extract time to timestamp type.

message,class
"2022-10-28 07:46:59,705               one=1 Two=2 Three=3",classA
"2022-10-27 10:03:59,800               four=4 Five=5 Six=6",classB

I tried below 2 ways, but neither of them works.

way1:

sparkDF.withColumn("TIMESTAMP", to_timestamp(regexp_extract(col('message'), '(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d+)', 1),"MM-dd-yyyy HH:mm:ss.SSSS"))

way2:

sparkDF.withColumn("TIMESTAMP", regexp_extract(col('message'), '(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d+)', 1).cast("timestamp")

can anyone please help take a look?

samkart · Accepted Answer

you could split the message field by " " (multiple spaces) and then extract the first element. the extracted element can be converted to a timestamp easily.

see example

data_sdf. \
    withColumn('ts', func.to_timestamp(func.split('message', '  ')[0], 'yyyy-MM-dd HH:mm:ss,SSS')). \
    show(truncate=False)

# +---------------------------------------------------------+------+-----------------------+
# |message                                                  |class |ts                     |
# +---------------------------------------------------------+------+-----------------------+
# |2022-10-28 07:46:59,705               one=1 Two=2 Three=3|classA|2022-10-28 07:46:59.705|
# |2022-10-27 10:03:59,800               four=4 Five=5 Six=6|classB|2022-10-27 10:03:59.8  |
# +---------------------------------------------------------+------+-----------------------+

# root
#  |-- message: string (nullable = true)
#  |-- class: string (nullable = true)
#  |-- ts: timestamp (nullable = true)

pyspark - why convert string(time) to timestamp return None?

Answers (1)

Related Questions