Reputation: 416
I am trying to convert a column containing date and time as strings to timestamp, however I am losing the milliseconds part during the conversion.
I have a Spark dataframe df
that has a date and a time column containing strings. The time string contains milliseconds, as shown below:
+---------+------------+
|date |time |
+---------+------------+
|2018/1/2 |09:53:25.864|
|2018/1/3 |11:32:21.689|
|2018/1/4 |09:34:51.045|
+---------+------------+
I concatenated date
and time
columns to get date_and_time
column (string):
import spark.sql.functions as F
df = df.withColumn('date_and_time', F.concat_ws(' ', df.date, df.time))
df.show(3, False)
Output:
+--------+------------+---------------------+
|date |time |date_and_time |
+--------+------------+---------------------+
|2018/1/2|09:53:25.864|2018/1/2 09:53:25.864|
|2018/1/3|11:32:21.689|2018/1/3 11:32:21.689|
|2018/1/4|09:34:51.045|2018/1/4 09:34:51.045|
+--------+------------+---------------------+
Then, I specified the timestamp format using Simple Date Format Date and Time patterns:
timestamp_format = 'yyyy/M/d HH:mm:ss.SSS'
Then, I tried to convert this string to timestamp using a couple of different ways:
df.select(
df.date_and_time,
F.to_timestamp(df.date_and_time, timestamp_format).alias('method_1'),
F.unix_timestamp(df.date_and_time, format=timestamp_format).cast('timestamp').alias('method_2')
).show(3, False)
As you can see below, the timestamp is missing the milliseconds part:
+---------------------+-------------------+-------------------+
|date_and_time |method_1 |method_2 |
+---------------------+-------------------+-------------------+
|2018/1/2 09:53:25.864|2018-01-02 09:53:25|2018-01-02 09:53:25|
|2018/1/3 11:32:21.689|2018-01-03 11:32:21|2018-01-03 11:32:21|
|2018/1/4 09:34:51.045|2018-01-04 09:34:51|2018-01-04 09:34:51|
+---------------------+-------------------+-------------------+
How can I preserve the milliseconds when converting the string to timestamp?
I am using PySpark (Spark: 2.3.1, Python: 3.6.5).
I have looked at previously answered questions on SO and have not found a suitable solution.
Upvotes: 9
Views: 1798
Reputation: 1119
Even though this is an old post, I think it may be useful for people. The solution in https://stackoverflow.com/a/54340652/4383754 is probably the best way that should scale well.
In case, you're looking for a simpler solution that can accept the performance hit of using a python UDF, here is one:
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf
from dateutil.parser import parse
data = [('2018/1/2', '09:53:25.864', '2018/1/2 09:53:25.864'),
('2018/1/3', '11:32:21.689', '2018/1/3 11:32:21.689'),
('2018/1/4', '09:34:51.045', '2018/1/4 09:34:51.045')]
df = spark.createDataFrame(
data, 'date STRING, time STRING, date_and_time STRING')
parse_udf = udf(parse, TimestampType())
df = df.withColumn('parsed', parse_udf(df['date_and_time']))
df.show()
# +--------+------------+--------------------+--------------------+
# | date| time| date_and_time| parsed|
# +--------+------------+--------------------+--------------------+
# |2018/1/2|09:53:25.864|2018/1/2 09:53:25...|2018-01-02 09:53:...|
# |2018/1/3|11:32:21.689|2018/1/3 11:32:21...|2018-01-03 11:32:...|
# |2018/1/4|09:34:51.045|2018/1/4 09:34:51...|2018-01-04 09:34:...|
# +--------+------------+--------------------+--------------------+
df.dtypes
# [('date', 'string'),
# ('time', 'string'),
# ('date_and_time', 'string'),
# ('parsed', 'timestamp')]
df[['parsed']].collect()[0][0]
# datetime.datetime(2018, 1, 2, 9, 53, 25, 864000) <- contains microsecond
Upvotes: 2