Reputation: 101
I need to complete my dataset with the dates that are missing, with the format: YYYY-MM-DD
In this example, I would like to add a "line" for the dates missing between the dates I have information for, with a value of 0 since I have no data for those dates!
The output would look like this:
Can someone help me ? Thanks!!
Upvotes: 1
Views: 274
Reputation: 10406
It is not that simple to achieve without putting all the data onto one single partition and trashing the performance. What I would do to avoid that is associate each date to an id, than use spark.range
to generate a dataframe all these ids and then join it with the original dataframe. It would go as follows:
import org.apache.spark.sql.Row
// let's create the sample dataframe
val df = Seq("2020-10-01" -> 10, "2020-10-03" -> 10, "2020-10-06" -> 10)
.toDF("Date", "Value")
.withColumn("Date", to_date('Date))
// Then, let's extract the first date and the number of days between the first
// and last dates
val Row(start : Date, diff : Int) = df
.select(min('Date) as "start", max('Date) as "end")
.select('start, datediff('end, 'start) as "diff")
.head
// Finally, we create an id equal to 0 for the first date and diff for the last
// By joining with a dataframe containing all the ids between 0 and diff,
// missing dates will be populated.
df
.withColumn("id", datediff('Date, lit(start)))
.join(spark.range(diff+1), Seq("id"), "right")
.withColumn("start", lit(start))
.select(expr("date_add(start, id)") as "Date", 'Value)
.show
+----------+-----+
| Date|Value|
+----------+-----+
|2020-10-01| 10|
|2020-10-02| null|
|2020-10-03| 10|
|2020-10-04| null|
|2020-10-05| null|
|2020-10-06| 10|
+----------+-----+
Upvotes: 0
Reputation: 640
can you just give an indication about the size of the data that you are working on
Upvotes: 1
Reputation: 22449
One approach would be to assemble a time-series dataframe using LocalDate
functions for the wanted date range and perform a left-join
, as shown below:
import java.time.LocalDate
val startDate: LocalDate = LocalDate.parse("2020-09-30")
val endDate: LocalDate = LocalDate.parse("2020-10-06")
val tsDF = Iterator.iterate(startDate)(_.plusDays(1)).
takeWhile(! _.isAfter(endDate)).
map(java.sql.Date.valueOf(_)).
toSeq.
toDF("date")
val df = Seq(
("2020-10-01", 10),
("2020-10-03", 10),
("2020-10-04", 10),
("2020-10-06", 10)
).toDF("date", "value")
tsDF.
join(df, Seq("date"), "left_outer").
select($"date", coalesce($"value", lit(0)).as("value")).
show
// +----------+-----+
// | date|value|
// +----------+-----+
// |2020-09-30| 0|
// |2020-10-01| 10|
// |2020-10-02| 0|
// |2020-10-03| 10|
// |2020-10-04| 10|
// |2020-10-05| 0|
// |2020-10-06| 10|
// +----------+-----+
Upvotes: 1