Subu Ganesh
Subu Ganesh

Reputation: 985

Convert a string to a timestamp object in Pyspark

I am trying to convert a string to a timestamp format in Pyspark.

from pyspark.sql.types import DateType

df = spark.createDataFrame([('28/Mar/2021:06:29:54 -0700',)], ['dt'])
df.select(date_format('dt', 'd/M/y:h:m:s Z').alias('date')).collect()

This does not seem to work, probably because the date_format function does not recognize this as a valid format. I get this :

[Row(date=None)]

Is there anyway I can make the pyspark functions understand the format doing something similar to the datetime module in Python ?

from datetime import datetime
datetime.strptime('28/Mar/2021:06:29:54 -0700', '%d/%b/%Y:%H:%M:%S %z')

When we pass the format it creates a valid datetime object

datetime.datetime(2021, 3, 28, 6, 29, 54, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=61200)))

Upvotes: 1

Views: 1685

Answers (1)

mck
mck

Reputation: 42422

Unlike the Python datetime module, in Spark, you need to specify the number of characters for each pattern. Also, use to_timestamp to convert a string to a timestamp type. date_format is for the other way round, i.e. converting timestamp types to a string.

from pyspark.sql.functions import to_timestamp

df = spark.createDataFrame([('28/Mar/2021:06:29:54 -0700',)], ['dt'])
df.select(to_timestamp('dt', 'dd/MMM/yyyy:HH:mm:ss Z').alias('date')).collect()
# gives [Row(date=datetime.datetime(2021, 3, 28, 14, 29, 54))]

Upvotes: 1

Related Questions