Reputation: 97
I have a df with product data with the following schema
root
|-- Creator: string (nullable = true)
|-- Created_datetime: timestamp (nullable = true)
|-- Last_modified_datetime: timestamp (nullable = true)
|-- Product_name: string (nullable = true)
the columns Created_datetime
looks the following
+-------------------+
| Created_datetime|
+-------------------+
|2019-10-12 17:09:18|
|2019-12-03 07:02:07|
|2020-01-16 23:10:08|
Now I would like to extract the average value (or the closest value to the avg existing) in the Created_datetime
column. How can this be achieved?
Upvotes: 0
Views: 1783
Reputation: 21
In this case I would like to apply casting first to unix_timestamp by passing date format of record then apply avg agg function after second casting datetime to time format like this snippet below:
from pyspark.sql.functions import avg, date_format, unix_timestamp, from_unixtime
df.withColumn('Created_datetime', unix_timestamp(date_format('Created_datetime', '2019-01-01 HH:mm:ss'))).agg(from_unixtime(avg('Created_datetime'),'HH:mm:ss').alias('Created_datetime_average')).show()
Upvotes: 0
Reputation: 3419
When you calculate the average of a timestamp
column, it will give you the average unix timestamp (long)
value. Cast it back to a timestamp
:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
df.agg(F.avg("Created_datetime").cast("timestamp").alias("avg_created_datetime")).show()
+--------------------+
|avg_created_datetime|
+--------------------+
| 2019-11-30 23:27:11|
+--------------------+
Upvotes: 3