Reputation: 113
I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152
?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp
function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2
to be formatted as 2019-01-04 11:09:21.152
Upvotes: 6
Views: 9943
Reputation: 3
You can directly cast the field and will it will have the milliesecond.
df = df.withColumn('datetime', col('timestamp').cast(TimestampType()))
Upvotes: 0
Reputation: 51
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
Upvotes: 4
Reputation: 51
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Upvotes: 1
Reputation: 21
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime()
or Date.now()
in JS or datetime.datetime.now().timestamp()
in Python, you can get a numerical milliseconds.
Upvotes: 2