Faliha Zikra
Faliha Zikra

Reputation: 421

Find average and total time between dates in a row string in pyspark?

The spark df column has a list of ordered dates demilited by '|' . It looks like this :

+------------------------------------------------------------+
|      date                                                  |
+------------------------------------------------------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02      |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01                 |
+------------------------------------------------------------+

I want to calculate the average time for each row and the total difference (last date- first date) in days like this :

+------------------------------------------------------------+---------------+---------------+
|      date                                                  | average time  | total time    |
+------------------------------------------------------------+---------------+---------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-08-01      |   584         | 2920          |
|2016-08-11|2013-08-25|2013-04-01|2012-08-11                 |   365         | 1460          |
+------------------------------------------------------------+---------------+---------------+

Upvotes: 2

Views: 1495

Answers (3)

blackbishop
blackbishop

Reputation: 32650

Here is a simplified version of the other answers using array built-in functions in Spark 2.4+.

First, split the dates string to get an array. Now, as the dates are already sorted, use element_at function to get first and last dates since when the index is negative it accesses the elements of the array from last to the first. The rest is just calculating datediff and divide it by the size of the array to get the average:

df.withColumn("array_dates", split(col("date"), "[|]"))\
  .withColumn("total_time", datediff(element_at("array_dates", -1), element_at("array_dates", 1))) \
  .withColumn("average_time", col("total_time") / size(col("array_dates"))) \
  .drop("array_dates")\
  .show(2, False)


#+------------------------------------------------------+----------+------------+
#|date                                                  |total_time|average_time|
#+------------------------------------------------------+----------+------------+
#|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043      |608.6       |
#|2016-08-11|2013-08-25|2013-04-01|2012-01-01           |1684      |421.0       |
#+------------------------------------------------------+----------+------------+

If the dates in the string aren't ordred as in your example, you could sort the array after split using array_sort

Upvotes: 0

murtihash
murtihash

Reputation: 8410

@Shu gave an awesome answer in Scala.

This just pyspark version. Feel free to give him the accepted answer.

from pyspark.sql import functions as F
df.withColumn("date", F.split(F.regexp_replace("date","\|",","),','))\
.withColumn("start", F.reverse(F.col("date"))[0]).withColumn("end",F.col("date")[0])\
.withColumn("total_time", F.datediff("end","start"))\
.withColumn("Average", F.col("total_time")/(F.size(F.col("date"))))\
.drop("start","end").show()

+--------------------+----------+-------+
|                date|total_time|Average|
+--------------------+----------+-------+
|[2018-08-01, 2017...|      3043|  608.6|
|[2016-08-11, 2013...|      1684|  421.0|
+--------------------+----------+-------+

Upvotes: 1

notNull
notNull

Reputation: 31480

We can use inbuilt array functions of dataframe for this case..

Example:

val df= Seq(("2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02"),("2016-08-11|2013-08-25|2013-04-01|2012-01-01")).toDF("id") //sample data

df.withColumn("arr_id", array_sort(split(col("id"),"\\|"))). //split to create array and sort the array
withColumn("first_ele", col("arr_id")(0)). // to get first element of array
withColumn("last_ele",reverse(col("arr_id"))(0)). //to get last element of array
withColumn("total_diff",datediff(col("last_ele").cast("date"), col("first_ele").cast("date"))). //get total diff of first and last dates
withColumn("avg_time", col("total_diff")/size(col("arr_id"))). //caluculate diff of date and divide with size of array column
select("id", "total_diff","avg_time"). 
show(false)

Result:

+------------------------------------------------------+----------+--------+
|id                                                    |total_diff|avg_time|
+------------------------------------------------------+----------+--------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043      |608.6   |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01           |1684      |421.0   |
+------------------------------------------------------+----------+--------+

I think the output in question and answer is off because start and end dates in array are little different in input and output in question!

Upvotes: 2

Related Questions