Agustina
Agustina

Reputation: 101

Populate dataset with missing dates (in days) with scala

I need to complete my dataset with the dates that are missing, with the format: YYYY-MM-DD

In this example, I would like to add a "line" for the dates missing between the dates I have information for, with a value of 0 since I have no data for those dates!

enter image description here

The output would look like this:

enter image description here

Can someone help me ? Thanks!!

Upvotes: 1

Views: 274

Answers (3)

Oli
Oli

Reputation: 10406

It is not that simple to achieve without putting all the data onto one single partition and trashing the performance. What I would do to avoid that is associate each date to an id, than use spark.range to generate a dataframe all these ids and then join it with the original dataframe. It would go as follows:

import org.apache.spark.sql.Row

// let's create the sample dataframe
val df = Seq("2020-10-01" -> 10, "2020-10-03" -> 10, "2020-10-06" -> 10)
     .toDF("Date", "Value")
     .withColumn("Date", to_date('Date))

// Then, let's extract the first date and the number of days between the first
// and last dates
val Row(start : Date, diff : Int) = df
    .select(min('Date) as "start", max('Date) as "end")
    .select('start, datediff('end, 'start) as "diff")
    .head

// Finally, we create an id equal to 0 for the first date and diff for the last
// By joining with a dataframe containing all the ids between 0 and diff,
// missing dates will be populated.
df
    .withColumn("id", datediff('Date, lit(start)))
    .join(spark.range(diff+1), Seq("id"), "right")
    .withColumn("start", lit(start))
    .select(expr("date_add(start, id)") as "Date", 'Value)
    .show
+----------+-----+
|      Date|Value|
+----------+-----+
|2020-10-01|   10|
|2020-10-02| null|
|2020-10-03|   10|
|2020-10-04| null|
|2020-10-05| null|
|2020-10-06|   10|
+----------+-----+

Upvotes: 0

Nassereddine BELGHITH
Nassereddine BELGHITH

Reputation: 640

can you just give an indication about the size of the data that you are working on

Upvotes: 1

Leo C
Leo C

Reputation: 22449

One approach would be to assemble a time-series dataframe using LocalDate functions for the wanted date range and perform a left-join, as shown below:

import java.time.LocalDate

val startDate: LocalDate = LocalDate.parse("2020-09-30")
val endDate: LocalDate = LocalDate.parse("2020-10-06")

val tsDF = Iterator.iterate(startDate)(_.plusDays(1)).
  takeWhile(! _.isAfter(endDate)).
  map(java.sql.Date.valueOf(_)).
  toSeq.
  toDF("date")

val df = Seq(
  ("2020-10-01", 10),
  ("2020-10-03", 10),
  ("2020-10-04", 10),
  ("2020-10-06", 10)
).toDF("date", "value")

tsDF.
  join(df, Seq("date"), "left_outer").
  select($"date", coalesce($"value", lit(0)).as("value")).
  show
// +----------+-----+
// |      date|value|
// +----------+-----+
// |2020-09-30|    0|
// |2020-10-01|   10|
// |2020-10-02|    0|
// |2020-10-03|   10|
// |2020-10-04|   10|
// |2020-10-05|    0|
// |2020-10-06|   10|
// +----------+-----+

Upvotes: 1

Related Questions