emax
emax

Reputation: 7255

Pyspark: how to extract hour from timestamp

I have a table like the following

    df

 +------------------------------------+-----------------------+
|identifier                          |timestamp              |
+------------------------------------+-----------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|
+------------------------------------+-----------------------+

tmp.printSchema()
root
 |-- identifier: string (nullable = true)
 |-- timestamp: string (nullable = true)

I would like to have a column that take only the day and the hours from the timestamp.

I am trying the following:

from pyspark.sql.functions import hour
df = df.withColumn("hour", hour(col("timestamp")))

but I get the following

+--------------------+--------------------+----+
|          identifier|           timestamp|hour|
+--------------------+--------------------+----+
|321869c3-71e5-41d...|2020-03-19 03:34:...|null|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|null|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|null|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|null|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|null|
+--------------------+--------------------+----+

while I would like to have

+--------------------+--------------------+-------------------+
|          identifier|           timestamp|hour               |
+--------------------+--------------------+-------------------+
|321869c3-71e5-41d...|2020-03-19 03:00:...|2020-03-19 03:00:00|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|2020-03-19 02:00:00|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|2020-03-19 01:00:00|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|2020-03-19 04:00:00|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|2020-03-19 01:00:00|
+--------------------+--------------------+-------------------+

Upvotes: 17

Views: 42294

Answers (5)

Curious Watcher
Curious Watcher

Reputation: 689

For Spark 3.3.0, a simple hour and weekofyear does the magic. Assumption: timestamp is already in correct format

from pyspark.sql import functions as SF

(
  df
  .withColumn("hour"      , SF.hour("timestamp") )
  .withColumn("weekofyear", SF.weekofyear("timestamp") )
  .show(n=2)
)

Upvotes: 2

Duong Vu
Duong Vu

Reputation: 197

You asked to get both date and hour, you can use the function provided by pyspark to extract only the date and hour like below:

3 steps:

  • Transform the timestamp column to timestamp format
  • Use date function to extract the date from the timestamp format
  • Use hour function to extract the hour from the timestamp format

The code would look like this:

from pyspark.sql.functions import *
# Step 1: transform to the correct col format
df = df.withColumn("timestamp", to_timestamp("timestamp", 'yyyy-MM-dd HH:mm:ss'))

# Step 2 & 3: Extract the needed information
df = df.withColumn('Date', date(df.timestamp))
df = df.withColumn('Hour', hour(df.timestamp))

# Display the result
df.show(1, False)
#+----------+--------------------+-------------------+-------------------+
#|identifier|           timestamp|               Date|               Hour|
#+----------+--------------------+-------------------+-------------------+
#|         1|2020-03-19 03:00:...|                 19|                 03|
#+----------+--------------------+-------------------+-------------------+

The hour col does not look exactly like what you describe since it's already been answered by notNull above. This is another way if you only want to get the number of date and hour for later groupby or aggregation for example.

Upvotes: 6

SamuelNLP
SamuelNLP

Reputation: 4136

why not just a custom udf?

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

hour = F.udf(lambda x: x.hour, IntegerType())
hours = df.withColumn("hour", hour("datetime"))

hours.limit(5).toPandas()

will give you this:

enter image description here

Upvotes: 3

notNull
notNull

Reputation: 31520

Use from_unixtime and unix_timestampfunctions as hour is used to extract hour value from timestamp (or) string(yyyy-MM-dd HH:mm:ss) type

from pyspark.sql.functions import *
#sample data
df.show(truncate=False)
#+----------+-----------------------+
#|identifier|timestamp              |
#+----------+-----------------------+
#|1         |2020-03-18 14:41:55 UTC|
#+----------+-----------------------+
#DataFrame[identifier: string, timestamp: string]

df.withColumn("hour", from_unixtime(unix_timestamp(col("timestamp"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00")).show()
#+----------+--------------------+-------------------+
#|identifier|           timestamp|               hour|
#+----------+--------------------+-------------------+
#|         1|2020-03-18 14:41:...|2020-03-18 14:00:00|
#+----------+--------------------+-------------------+

Usage of hour function:

#on string type 
spark.sql("select hour('2020-03-04 12:34:34')").show()
#on timestamp type
spark.sql("select hour(timestamp('2020-03-04 12:34:34'))").show()
#+---+
#|_c0|
#+---+
#| 12|
#+---+

Upvotes: 3

murtihash
murtihash

Reputation: 8410

You should use pyspark inbuilt function date_trunc to truncate to hour. You can also truncate to day/month/year etc.

from pyspark.sql import functions as F
df.withColumn("hour", F.date_trunc('hour',F.to_timestamp("timestamp","yyyy-MM-dd HH:mm:ss 'UTC'")))\
  .show(truncate=False)


+------------------------------------+-----------------------+-------------------+
|identifier                          |timestamp              |hour               |
+------------------------------------+-----------------------+-------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|2020-03-18 14:00:00|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|2020-03-13 15:00:00|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|2020-03-16 12:00:00|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|2020-03-18 22:00:00|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|2020-03-17 18:00:00|
+------------------------------------+-----------------------+-------------------+

Upvotes: 15

Related Questions