Working with Microsecond Time Stamps in PySpark

I have a pyspark dataframe with the following time format 20190111-08:15:45.275753. I want to convert this to timestamp format keeping the microsecond granularity. However, it appears as though it is difficult to keep the microseconds as all time conversions in pyspark produce seconds?

Do you have a clue on how this can be done? Note that converting it to pandas etc will not work as the dataset is huge so I need an efficient way of doing this. Example of how i am doing this below

time_df = spark.createDataFrame([('20150408-01:12:04.275753',)], ['dt'])
res = time_df.withColumn("time",  unix_timestamp(col("dt"), \
format='yyyyMMdd-HH:mm:ss.000').alias("time"))
res.show(5, False)

Upvotes: 5

Answers (2)

roschach

Reputation: 9396

Normally timestamp granularity is in seconds so I do not think there is a direct method to keep milliseconds granularity.

In pyspark there is the function unix_timestamp that :

unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')

Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
if `timestamp` is None, then it returns current timestamp.

>>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>> time_df = spark.createDataFrame([('2015-04-08',)], ['dt'])
>>> time_df.select(unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()
[Row(unix_time=1428476400)]
>>> spark.conf.unset("spark.sql.session.timeZone")

A usage example:

import pyspark.sql.functions as F
res = df.withColumn(colName,  F.unix_timestamp(F.col(colName), \
    format='yyyy-MM-dd HH:mm:ss.000').alias(colName) )

What you might do is splitting your date string (str.rsplit('.', 1)) keeping the milliseconds apart (for example by creating another column) in your dataframe.

EDIT

In your example the problem is that the time is of type string. First you need to convert it to a timestamp type: this can be done with:

res = time_df.withColumn("new_col", to_timestamp("dt", "yyyyMMdd-hh:mm:ss"))

Then you can use unix_timestamp

res2 = res.withColumn("time",  F.unix_timestamp(F.col("parsed"), format='yyyyMMdd-hh:mm:ss.000').alias("time"))

Finally to create a columns with milliseconds:

res3 = res2.withColumn("ms", F.split(res2['dt'], '[.]').getItem(1))

Upvotes: 3

CodeGeek123

Reputation: 4501

I've found a work around for this using to_utc_timestamp function in pyspark, however not entirely sure if this is the most efficient though it seems to work fine on about 100 mn rows of data. You can avoid the regex_replace if your timestamp string looked like this - 1997-02-28 10:30:40.897748

 from pyspark.sql.functions import regexp_replace, to_utc_timestamp

 df = spark.createDataFrame([('19970228-10:30:40.897748',)], ['new_t'])
 df = df.withColumn('t', regexp_replace('new_t', '^(.{4})(.{2})(.{2})-', '$1-$2-$3 '))
 df = df.withColumn("time", to_utc_timestamp(df.t, "UTC").alias('t'))
 df.show(5,False)
 print(df.dtypes)

Upvotes: 2

Working with Microsecond Time Stamps in PySpark

Answers (2)

Related Questions