Insert missing date rows and insert old values in the new rows PySpark

Question

I have a DataFrame that contains a person, a weight and and timestamp as such:

+-----------+-------------------+------+
|     person|          timestamp|weight|
+-----------+-------------------+------+
|          1|2019-12-02 14:54:17| 49.94|
|          1|2019-12-03 08:58:39| 50.49|
|          1|2019-12-06 10:44:01| 50.24|
|          2|2019-12-02 08:58:39| 62.32|
|          2|2019-12-04 10:44:01| 65.64|
+-----------+-------------------+------+

I want to fill such that every person has something for every date, meaning that the above should be:

+-----------+-------------------+------+
|     person|          timestamp|weight|
+-----------+-------------------+------+
|          1|2019-12-02 14:54:17| 49.94|
|          1|2019-12-03 08:58:39| 50.49|
|          1|2019-12-04 00:00:01| 50.49|
|          1|2019-12-05 00:00:01| 50.49|
|          1|2019-12-06 10:44:01| 50.24|
|          1|2019-12-07 00:00:01| 50.24|
|          1|2019-12-08 00:00:01| 50.24|
|          2|2019-12-02 08:58:39| 62.32|
|          2|2019-12-03 00:00:01| 62.32|
|          2|2019-12-04 10:44:01| 65.64|
|          2|2019-12-05 00:00:01| 65.64|
|          2|2019-12-06 00:00:01| 65.64|
|          2|2019-12-07 00:00:01| 65.64|
|          2|2019-12-08 00:00:01| 65.64|
+-----------+-------------------+------+

I have defined a new table that use datediff to contain all dates between the min and max date:

min_max_date = df_person_weights.select(min("timestamp"), max("timestamp")) \
        .withColumnRenamed("min(timestamp)", "min_date") \
        .withColumnRenamed("max(timestamp)", "max_date")

min_max_date = min_max_date.withColumn("datediff", datediff("max_date", "min_date")) \
        .withColumn("repeat", expr("split(repeat(',', datediff), ',')")) \
        .select("*", posexplode("repeat").alias("date", "val")) \
        .withColumn("date", expr("date_add(min_date, date)"))

This gives me a new DataFrame that contains the dates like:

+----------+
|      date|
+----------+
|2019-12-03|    
|2019-12-03|
|2019-12-04|
|2019-12-05|
|2019-12-06|
|2019-12-07|
|2019-12-08|
+----------+

I have tried different joins like:

min_max_date.join(df_price_history, min_max_date.date != df_price_history.date, "leftouter")

But I'm not getting the results that I need, can someone help with this? How do I merge the information I have now?

Insert missing date rows and insert old values in the new rows PySpark

Answers (1)

Related Questions