data_person
data_person

Reputation: 4480

Timestamp parsing in pyspark

df1:

Timestamp:

1995-08-01T00:00:01.000+0000

Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.

Upvotes: 8

Views: 22003

Answers (3)

HimanshuGahlot
HimanshuGahlot

Reputation: 571

I am working on Apache Spark 3.2.1

I have tried the following:

  • format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
  • format = "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"

Both of the format didn't work for me, what worked for me is the following

  • format = "yyyy-MM-dd'T'HH:mm:ss.SSSz"

I have seen another difference too for using unix_timestamp before to_timestamp, here are the findings

  • If you use unix_timestamp - your original timestamp will loose anything after .
    • For example 2024-02-11T20:07:28.099+00:00 this will change to 2024-02-11T20:07:28.000+00:00
  • Whereas if you don't use unix_timestamp and use the following instead you will be able to retain everything

df = df.withColumn("ConvertToTimestamp", F.to_timestamp("StringTimestamp", format))

Upvotes: 0

data_person
data_person

Reputation: 4480

Code:

df1.select(dayofmonth('Timestamp').alias('day'))

Upvotes: 0

Daniel de Paula
Daniel de Paula

Reputation: 17872

You can parse this timestamp using unix_timestamp:

from pyspark.sql import functions as F

format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
df2 = df1.withColumn('Timestamp2', F.unix_timestamp('Timestamp', format).cast('timestamp'))

Then, you can use dayofmonth in the new Timestamp column:

df2.select(F.dayofmonth('Timestamp2'))

More detials about these functions can be found in the pyspark functions documentation.

Upvotes: 14

Related Questions