Anand Hemmige
Anand Hemmige

Reputation: 113

pyspark to_timestamp does not include milliseconds

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?

I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.

This is my dataframe.

+--------------------------+
|updated_date              |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+

I use the millisecond format without any success as below

>>> df.select('updated_date').withColumn("updated_date_col2", 
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date              |updated_date_col2  |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+

I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152

Upvotes: 6

Views: 9943

Answers (4)

MrBigData
MrBigData

Reputation: 3

You can directly cast the field and will it will have the milliesecond.

df = df.withColumn('datetime', col('timestamp').cast(TimestampType()))

Upvotes: 0

iwataka
iwataka

Reputation: 51

I think you can use UDF and Python's standard datetime module as below.

import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType

def _to_timestamp(s):
    return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')

udf_to_timestamp = udf(_to_timestamp, TimestampType())

df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)

Upvotes: 4

Adeeb
Adeeb

Reputation: 51

Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.

Following workaround may work:

If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression

ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"

# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))

# add milliseconds as inteval
if 'S' in timestamp_pattern:
   df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))

To get INTERVAL 256 MILLISECONDS we may use a Java UDF:

df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))

Inside UDF: getIntervalStringUDF(String timeString, String pattern)

  1. Use SimpleDateFormat to parse date according to pattern
  2. return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
  3. return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions

Upvotes: 1

JunL22
JunL22

Reputation: 21

This is not a solution with to_timestamp but you can easily keep your column to time format

Following code is one of example on converting a numerical milliseconds to timestamp.

from datetime import datetime

ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False) 

+---+--------------------------+
|obs|time                      |
+---+--------------------------+
|1  |2019-01-15 12:15:49.565263|
+---+--------------------------+

if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.

Upvotes: 2

Related Questions