Pyspark - convert time format of column with 2 time formats to a common time format

Question

Column dat has 2 types of time stamp. I'm trying to cast multiple string date formats to a single format.

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

#Sample data

data1 = [("host1","cpu","2020-03-23 07:30:20"),
       ("host2","memory","1616131516"),
       ("host3","disk","2020-03-23 08:50:00"),
       ("host4","memory","1816131316"),
         ]

#Defining Schema

schema1= StructType([ \
    StructField("hostname",StringType(),True), \
    StructField("kpi",StringType(),True), \
    StructField("dat",StringType(),True), \
    ])

#Creating dataframe

df = spark.createDataFrame(data=data1,schema=schema1)
df.printSchema()
df.show(truncate=False)

root
 |-- hostname: string (nullable = true)
 |-- kpi: string (nullable = true)
 |-- dat: string (nullable = true)

+--------+------+-------------------+
|hostname|kpi   |dat                |
+--------+------+-------------------+
|host1   |cpu   |2020-03-23 07:30:20|
|host2   |memory|1616131516         |
|host3   |disk  |2020-03-23 08:50:00|
|host4   |memory|1816131316         |
+--------+------+-------------------+

I have the code which converts only the unixtime format. I would need to convert both the formats of column 'dat' to the desired format :"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" in a single line of code as I'm using data streaming.

df1 = df.withColumn('datetime',from_unixtime(df.dat,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(truncate=False)
df.show(truncate=False)
+--------+------+-------------------+------------------------+
|hostname|kpi   |dat                |datetime                |
+--------+------+-------------------+------------------------+
|host1   |cpu   |2020-03-23 07:30:20|null                    |
|host2   |memory|1616131516         |2021-03-19T05:25:16.000Z|
|host3   |disk  |2020-03-23 08:50:00|null                    |
|host4   |memory|1816131316         |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+

My desired dataframe is :

+--------+------+-------------------+------------------------+
|hostname|kpi   |dat                |datetime                |
+--------+------+-------------------+------------------------+
|host1   |cpu   |2020-03-23 07:30:20|2020-03-23T07:30:20.000Z|
|host2   |memory|1616131516         |2021-03-19T05:25:16.000Z|
|host3   |disk  |2020-03-23 08:50:00|2020-03-23T08:50:00.000Z|
|host4   |memory|1816131316         |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+

mck · Accepted Answer

You can use date_format to convert the other "standard" date format to the desired format, and coalesce with your existing column where you converted using from_unixtime.

import pyspark.sql.functions as F

df1 = df.withColumn(
    'datetime',
    F.coalesce(
        F.from_unixtime(df.dat, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), 
        F.date_format(df.dat, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
    )
)

df1.show(truncate=False)
+--------+------+-------------------+------------------------+
|hostname|kpi   |dat                |datetime                |
+--------+------+-------------------+------------------------+
|host1   |cpu   |2020-03-23 07:30:20|2020-03-23T07:30:20.000Z|
|host2   |memory|1616131516         |2021-03-19T05:25:16.000Z|
|host3   |disk  |2020-03-23 08:50:00|2020-03-23T08:50:00.000Z|
|host4   |memory|1816131316         |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+

Pyspark - convert time format of column with 2 time formats to a common time format

Answers (1)

Related Questions