Reputation: 382

pyspark to_timestamp() handling format of miliseconds SSS

I have distorted Data,
I am using below function here.

to_timestamp("col","yyyy-MM-dd'T'hh:mm:ss.SSS'Z'")

Data:

time                      | OUTPUT                         | IDEAL
2022-06-16T07:01:25.346Z  | 2022-06-16T07:01:25.346+0000   | 2022-06-16T07:01:25.346+0000
2022-06-16T06:54:21.51Z   | 2022-06-16T06:54:21.051+0000   | 2022-06-16T06:54:21.510+0000
2022-06-16T06:54:21.5Z    | 2022-06-16T06:54:21.005+0000   | 2022-06-16T06:54:21.500+0000

so, I have S or SS or SSS format for milisecond in data.
How can i normalise it into SSS correct way?
Here, 51 miliseconds mean 510 not 051.

Using spark version : 3.2.1 Code :

import pyspark.sql.functions as F
test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])
timeFmt = "yyyy-MM-dd'T'hh:mm:ss.SSS'Z'"
test = test.withColumn("timing2", (F.to_timestamp(F.col('timing1'),format=timeFmt)))
test.select("timing1","timing2").show(truncate=False)

Output:

Upvotes: 1

Answers (2)

lil-wolf

Reputation: 382

I was using this setting :

 spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

I have to reset this and it is working as normal.

Upvotes: 1

Luiz Viola

Reputation: 2416

I also use v3.2.1 and it works for me if you just don't parse the timestamp format. It is already in the right format:

from pyspark.sql import functions as F

test = spark.createDataFrame([(1,'2022-06-16T07:01:25.346Z'),(2,'2022-06-16T06:54:21.51Z'),(3,'2022-06-16T06:54:21.5Z')],['no','timing1'])

new_df = test.withColumn('timing1_ts', F.to_timestamp('timing1'))\

new_df.show(truncate=False)

new_df.dtypes

+---+------------------------+-----------------------+
|no |timing1                 |timing1_ts             |
+---+------------------------+-----------------------+
|1  |2022-06-16T07:01:25.346Z|2022-06-16 07:01:25.346|
|2  |2022-06-16T06:54:21.51Z |2022-06-16 06:54:21.51 |
|3  |2022-06-16T06:54:21.5Z  |2022-06-16 06:54:21.5  |
+---+------------------------+-----------------------+

Out[9]: [('no', 'bigint'), ('timing1', 'string'), ('timing1_ts', 'timestamp')]

Upvotes: 1

pyspark to_timestamp() handling format of miliseconds SSS

Answers (2)

Related Questions