Rasika
Rasika

Reputation: 507

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column. I am converting it to timestamp, but the values are changing. Following is my code, can anyone help me to convert without changing values.

df=spark.createDataFrame(
        data = [ ("1","2020-04-06 15:06:16 +00:00")],
        schema=["id","input_timestamp"])
df.printSchema()

#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp")) 
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
         to_timestamp('timestamp').cast('string')) \
  .show(truncate=False)

This is the output:

+---+--------------------------+-------------------+-------------------+
|id |input_timestamp           |timestamp          |timestamp_string   |
+---+--------------------------+-------------------+-------------------+
|1  |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+

I want to know why the hour is changing from 15 to 8 and how can I prevent it?

Upvotes: 1

Views: 7651

Answers (2)

notNull
notNull

Reputation: 31460

I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.

  • Try to pass the format to to_timestamp() function.

Example:

from pyspark.sql.functions import to_timestamp

df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp           |timestamp          |
#+---+--------------------------+-------------------+
#|1  |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+

Upvotes: 1

过过招
过过招

Reputation: 4189

from pyspark.sql.functions import to_utc_timestamp

df = spark.createDataFrame(
    data=[('1', '2020-04-06 15:06:16 +00:00')],
    schema=['id', 'input_timestamp'])
df.printSchema()

df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp', 
    your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)

Replace your_local_timezone with the actual value.

Upvotes: 0

Related Questions