Reputation: 421
The spark df column has a list of ordered dates demilited by '|' . It looks like this :
+------------------------------------------------------------+
| date |
+------------------------------------------------------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02 |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |
+------------------------------------------------------------+
I want to calculate the average time for each row and the total difference (last date- first date) in days like this :
+------------------------------------------------------------+---------------+---------------+
| date | average time | total time |
+------------------------------------------------------------+---------------+---------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-08-01 | 584 | 2920 |
|2016-08-11|2013-08-25|2013-04-01|2012-08-11 | 365 | 1460 |
+------------------------------------------------------------+---------------+---------------+
Upvotes: 2
Views: 1495
Reputation: 32650
Here is a simplified version of the other answers using array built-in functions in Spark 2.4+.
First, split the dates string to get an array.
Now, as the dates are already sorted, use element_at
function to get first and last dates since when the index is negative it accesses the elements of the array from last to the first. The rest is just calculating datediff
and divide it by the size of the array to get the average:
df.withColumn("array_dates", split(col("date"), "[|]"))\
.withColumn("total_time", datediff(element_at("array_dates", -1), element_at("array_dates", 1))) \
.withColumn("average_time", col("total_time") / size(col("array_dates"))) \
.drop("array_dates")\
.show(2, False)
#+------------------------------------------------------+----------+------------+
#|date |total_time|average_time|
#+------------------------------------------------------+----------+------------+
#|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043 |608.6 |
#|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |1684 |421.0 |
#+------------------------------------------------------+----------+------------+
If the dates in the string aren't ordred as in your example, you could sort the array after split using array_sort
Upvotes: 0
Reputation: 8410
@Shu gave an awesome answer in Scala.
This just pyspark version. Feel free to give him the accepted answer.
from pyspark.sql import functions as F
df.withColumn("date", F.split(F.regexp_replace("date","\|",","),','))\
.withColumn("start", F.reverse(F.col("date"))[0]).withColumn("end",F.col("date")[0])\
.withColumn("total_time", F.datediff("end","start"))\
.withColumn("Average", F.col("total_time")/(F.size(F.col("date"))))\
.drop("start","end").show()
+--------------------+----------+-------+
| date|total_time|Average|
+--------------------+----------+-------+
|[2018-08-01, 2017...| 3043| 608.6|
|[2016-08-11, 2013...| 1684| 421.0|
+--------------------+----------+-------+
Upvotes: 1
Reputation: 31480
We can use inbuilt array functions
of dataframe for this case..
Example:
val df= Seq(("2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02"),("2016-08-11|2013-08-25|2013-04-01|2012-01-01")).toDF("id") //sample data
df.withColumn("arr_id", array_sort(split(col("id"),"\\|"))). //split to create array and sort the array
withColumn("first_ele", col("arr_id")(0)). // to get first element of array
withColumn("last_ele",reverse(col("arr_id"))(0)). //to get last element of array
withColumn("total_diff",datediff(col("last_ele").cast("date"), col("first_ele").cast("date"))). //get total diff of first and last dates
withColumn("avg_time", col("total_diff")/size(col("arr_id"))). //caluculate diff of date and divide with size of array column
select("id", "total_diff","avg_time").
show(false)
Result:
+------------------------------------------------------+----------+--------+
|id |total_diff|avg_time|
+------------------------------------------------------+----------+--------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043 |608.6 |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |1684 |421.0 |
+------------------------------------------------------+----------+--------+
I think the output in question and answer is off because start and end dates
in array are little different in input
and output
in question!
Upvotes: 2