Reputation: 913
Column dat
has 2 types of time stamp. I'm trying to cast multiple string date formats to a single format.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
#Sample data
data1 = [("host1","cpu","2020-03-23 07:30:20"),
("host2","memory","1616131516"),
("host3","disk","2020-03-23 08:50:00"),
("host4","memory","1816131316"),
]
#Defining Schema
schema1= StructType([ \
StructField("hostname",StringType(),True), \
StructField("kpi",StringType(),True), \
StructField("dat",StringType(),True), \
])
#Creating dataframe
df = spark.createDataFrame(data=data1,schema=schema1)
df.printSchema()
df.show(truncate=False)
root
|-- hostname: string (nullable = true)
|-- kpi: string (nullable = true)
|-- dat: string (nullable = true)
+--------+------+-------------------+
|hostname|kpi |dat |
+--------+------+-------------------+
|host1 |cpu |2020-03-23 07:30:20|
|host2 |memory|1616131516 |
|host3 |disk |2020-03-23 08:50:00|
|host4 |memory|1816131316 |
+--------+------+-------------------+
I have the code which converts only the unixtime format. I would need to convert both the formats of column 'dat
' to the desired format :"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
in a single line of code as I'm using data streaming.
df1 = df.withColumn('datetime',from_unixtime(df.dat,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(truncate=False)
df.show(truncate=False)
+--------+------+-------------------+------------------------+
|hostname|kpi |dat |datetime |
+--------+------+-------------------+------------------------+
|host1 |cpu |2020-03-23 07:30:20|null |
|host2 |memory|1616131516 |2021-03-19T05:25:16.000Z|
|host3 |disk |2020-03-23 08:50:00|null |
|host4 |memory|1816131316 |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+
My desired dataframe is :
+--------+------+-------------------+------------------------+
|hostname|kpi |dat |datetime |
+--------+------+-------------------+------------------------+
|host1 |cpu |2020-03-23 07:30:20|2020-03-23T07:30:20.000Z|
|host2 |memory|1616131516 |2021-03-19T05:25:16.000Z|
|host3 |disk |2020-03-23 08:50:00|2020-03-23T08:50:00.000Z|
|host4 |memory|1816131316 |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+
Upvotes: 1
Views: 365
Reputation: 42392
You can use date_format
to convert the other "standard" date format to the desired format, and coalesce
with your existing column where you converted using from_unixtime
.
import pyspark.sql.functions as F
df1 = df.withColumn(
'datetime',
F.coalesce(
F.from_unixtime(df.dat, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),
F.date_format(df.dat, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
)
)
df1.show(truncate=False)
+--------+------+-------------------+------------------------+
|hostname|kpi |dat |datetime |
+--------+------+-------------------+------------------------+
|host1 |cpu |2020-03-23 07:30:20|2020-03-23T07:30:20.000Z|
|host2 |memory|1616131516 |2021-03-19T05:25:16.000Z|
|host3 |disk |2020-03-23 08:50:00|2020-03-23T08:50:00.000Z|
|host4 |memory|1816131316 |2027-07-21T00:55:16.000Z|
+--------+------+-------------------+------------------------+
Upvotes: 1