Reputation: 7255
I have a table like the following
df
+------------------------------------+-----------------------+
|identifier |timestamp |
+------------------------------------+-----------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|
+------------------------------------+-----------------------+
tmp.printSchema()
root
|-- identifier: string (nullable = true)
|-- timestamp: string (nullable = true)
I would like to have a column that take only the day and the hours from the timestamp.
I am trying the following:
from pyspark.sql.functions import hour
df = df.withColumn("hour", hour(col("timestamp")))
but I get the following
+--------------------+--------------------+----+
| identifier| timestamp|hour|
+--------------------+--------------------+----+
|321869c3-71e5-41d...|2020-03-19 03:34:...|null|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|null|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|null|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|null|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|null|
+--------------------+--------------------+----+
while I would like to have
+--------------------+--------------------+-------------------+
| identifier| timestamp|hour |
+--------------------+--------------------+-------------------+
|321869c3-71e5-41d...|2020-03-19 03:00:...|2020-03-19 03:00:00|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|2020-03-19 02:00:00|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|2020-03-19 01:00:00|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|2020-03-19 04:00:00|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|2020-03-19 01:00:00|
+--------------------+--------------------+-------------------+
Upvotes: 17
Views: 42294
Reputation: 689
For Spark 3.3.0, a simple hour
and weekofyear
does the magic. Assumption: timestamp
is already in correct format
from pyspark.sql import functions as SF
(
df
.withColumn("hour" , SF.hour("timestamp") )
.withColumn("weekofyear", SF.weekofyear("timestamp") )
.show(n=2)
)
Upvotes: 2
Reputation: 197
You asked to get both date and hour, you can use the function provided by pyspark to extract only the date and hour like below:
3 steps:
The code would look like this:
from pyspark.sql.functions import *
# Step 1: transform to the correct col format
df = df.withColumn("timestamp", to_timestamp("timestamp", 'yyyy-MM-dd HH:mm:ss'))
# Step 2 & 3: Extract the needed information
df = df.withColumn('Date', date(df.timestamp))
df = df.withColumn('Hour', hour(df.timestamp))
# Display the result
df.show(1, False)
#+----------+--------------------+-------------------+-------------------+
#|identifier| timestamp| Date| Hour|
#+----------+--------------------+-------------------+-------------------+
#| 1|2020-03-19 03:00:...| 19| 03|
#+----------+--------------------+-------------------+-------------------+
The hour col does not look exactly like what you describe since it's already been answered by notNull above. This is another way if you only want to get the number of date and hour for later groupby or aggregation for example.
Upvotes: 6
Reputation: 4136
why not just a custom udf?
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
hour = F.udf(lambda x: x.hour, IntegerType())
hours = df.withColumn("hour", hour("datetime"))
hours.limit(5).toPandas()
will give you this:
Upvotes: 3
Reputation: 31520
Use from_unixtime and unix_timestamp
functions as hour
is used to extract hour value from timestamp
(or) string(yyyy-MM-dd HH:mm:ss)
type
from pyspark.sql.functions import *
#sample data
df.show(truncate=False)
#+----------+-----------------------+
#|identifier|timestamp |
#+----------+-----------------------+
#|1 |2020-03-18 14:41:55 UTC|
#+----------+-----------------------+
#DataFrame[identifier: string, timestamp: string]
df.withColumn("hour", from_unixtime(unix_timestamp(col("timestamp"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00")).show()
#+----------+--------------------+-------------------+
#|identifier| timestamp| hour|
#+----------+--------------------+-------------------+
#| 1|2020-03-18 14:41:...|2020-03-18 14:00:00|
#+----------+--------------------+-------------------+
Usage of hour function:
#on string type
spark.sql("select hour('2020-03-04 12:34:34')").show()
#on timestamp type
spark.sql("select hour(timestamp('2020-03-04 12:34:34'))").show()
#+---+
#|_c0|
#+---+
#| 12|
#+---+
Upvotes: 3
Reputation: 8410
You should use pyspark inbuilt function date_trunc
to truncate to hour
. You can also truncate to day/month/year etc.
from pyspark.sql import functions as F
df.withColumn("hour", F.date_trunc('hour',F.to_timestamp("timestamp","yyyy-MM-dd HH:mm:ss 'UTC'")))\
.show(truncate=False)
+------------------------------------+-----------------------+-------------------+
|identifier |timestamp |hour |
+------------------------------------+-----------------------+-------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|2020-03-18 14:00:00|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|2020-03-13 15:00:00|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|2020-03-16 12:00:00|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|2020-03-18 22:00:00|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|2020-03-17 18:00:00|
+------------------------------------+-----------------------+-------------------+
Upvotes: 15