Reputation: 389
message,class "2022-10-28 07:46:59,705 one=1 Two=2 Three=3",classA "2022-10-27 10:03:59,800 four=4 Five=5 Six=6",classB
way1:
sparkDF.withColumn("TIMESTAMP", to_timestamp(regexp_extract(col('message'), '(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d+)', 1),"MM-dd-yyyy HH:mm:ss.SSSS"))
way2:
sparkDF.withColumn("TIMESTAMP", regexp_extract(col('message'), '(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d+)', 1).cast("timestamp")
can anyone please help take a look?
Upvotes: 0
Views: 209
Reputation: 6644
you could split the message
field by " "
(multiple spaces) and then extract the first element. the extracted element can be converted to a timestamp easily.
see example
data_sdf. \
withColumn('ts', func.to_timestamp(func.split('message', ' ')[0], 'yyyy-MM-dd HH:mm:ss,SSS')). \
show(truncate=False)
# +---------------------------------------------------------+------+-----------------------+
# |message |class |ts |
# +---------------------------------------------------------+------+-----------------------+
# |2022-10-28 07:46:59,705 one=1 Two=2 Three=3|classA|2022-10-28 07:46:59.705|
# |2022-10-27 10:03:59,800 four=4 Five=5 Six=6|classB|2022-10-27 10:03:59.8 |
# +---------------------------------------------------------+------+-----------------------+
# root
# |-- message: string (nullable = true)
# |-- class: string (nullable = true)
# |-- ts: timestamp (nullable = true)
Upvotes: 1