thentangler
thentangler

Reputation: 1256

How to stop timestamp in pyspark from dropping trailing zeroes

I have Spark dataframe in where the Timestamp is in milliseconds.

+-----------------------+
|CALC_TS                |
+-----------------------+
|2021-01-27 01:35:05.043|
|2021-01-27 01:35:05.043|    
|2021-01-27 01:35:05.043|

I want to make it show microseconds like so:

+--------------------------+
|CALC_TS                   |
+--------------------------+
|2021-01-27 01:35:05.043000|
|2021-01-27 01:35:05.043000|
|2021-01-27 01:35:05.043000|

So basically I would like the milliseconds portion to show in terms of microseconds. In the above example, the 43 milliseconds from the 1st dataframe would be 43 thousand microseconds as shown in the seconds dataframe.

I have tried:

df.withColumn('TIME', to_timestamp('CALC_TS', 'yyyy-MM-dd HH:mm:ss.SSSSSS'))

and

df.withColumn('TIME', col('CALC_TS').cast("timestamp"))

But they are giving the same result and stripping the last 3 zeroes. Is there a way to achieve this?

Upvotes: 2

Views: 1889

Answers (3)

Cena
Cena

Reputation: 3419

You can use rpad.

Right pad with trailing zeros upto the expected length of your timestamp. In your case, a length of 26 characters (for format yyyy-MM-dd HH:mm:ss.SSSSSS)

from pyspark.sql.functions import *

df.withColumn('CALC_TS_1', col('CALC_TS').cast("timestamp"))\
    .withColumn('CALC_TS_1', rpad(col('CALC_TS_1').cast('string'),26,'0'))\
    .show(truncate=False)

+--------------------------+--------------------------+
|CALC_TS                   |CALC_TS_1                 |
+--------------------------+--------------------------+
|2021-01-27 01:35:05.043   |2021-01-27 01:35:05.043000|
|2021-01-27 01:35:05.043567|2021-01-27 01:35:05.043567|
+--------------------------+--------------------------+

Upvotes: 2

mazaneicha
mazaneicha

Reputation: 9427

to_timestamp(timestamp_str[,fmt]) accepts a string and returns a timestamp (type). If your CALC_TS is already a timestamp as you said, you should rather use df.withColumn('TIME', date_format('CALC_TS','yyyy-MM-dd HH:mm:ss.SSSSSS')) to format it to string, with microseconds precision. From Spark reference:

o Fraction: Use one or more (up to 9) contiguous 'S' characters, e,g SSSSSS, to parse and format fraction of second. For parsing, the acceptable fraction length can be [1, the number of contiguous ‘S’]. For formatting, the fraction length would be padded to the number of contiguous ‘S’ with zeros. Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.

For Spark 2.4, and just to make it look like the precision of a timestamp field is microseconds, perhaps you can "fake" trailing zeroes while formatting it like this: date_format('CALC_TS','yyyy-MM-dd HH:mm:ss.SSS000')

Upvotes: 2

blackbishop
blackbishop

Reputation: 32660

If the columnCALC_TS is of type string, first convert to TimestampType using to_timestamp and unix_timestamp functions then using date_format you can format it with 6 fractions in milliseconds :

from pyspark.sql import functions as F

df.printSchema()

#root
# |-- CALC_TS: string (nullable = true)

df1 = df.withColumn(
   'TIME',
   F.to_timestamp(
       F.unix_timestamp('CALC_TS', "yyyy-MM-dd HH:mm:ss.SSS") # seconds
       + F.substring_index('CALC_TS', '.', -1).cast('float') / 1000 # milliseconds part
   )
).withColumn(
   "TIME_FORMAT",
   F.date_format("TIME", "yyyy-MM-dd HH:mm:ss.SSSSSS")
)

df1.show(truncate=False)

#+-----------------------+-----------------------+--------------------------+
#|CALC_TS                |TIME                   |TIME_FORMAT               |
#+-----------------------+-----------------------+--------------------------+
#|2021-01-27 01:35:05.043|2021-01-27 01:35:05.043|2021-01-27 01:35:05.000043|
#|2021-01-27 01:35:05.043|2021-01-27 01:35:05.043|2021-01-27 01:35:05.000043|
#|2021-01-27 01:35:05.043|2021-01-27 01:35:05.043|2021-01-27 01:35:05.000043|
#+-----------------------+-----------------------+--------------------------+

#root
# |-- CALC_TS: string (nullable = true)
# |-- TIME: timestamp (nullable = true)
# |-- TIME_FORMAT: string (nullable = true)

If the column is already of type timestamp, simply use date_format as in the above code.

Upvotes: 1

Related Questions