Transform dataset with empty data for dates

Question

I have a dataset with date,accountid and value. I want to transform the dataset to a new dataset where if accountid is not present in a particular date then add a accountid with value of 0 against that date.Is this possible

    val df = sc.parallelize(Seq(("2018-01-01", 100.5,"id1"),
  ("2018-01-02", 120.6,"id1"),
  ("2018-01-03", 450.2,"id2")
  )).toDF("date", "val","accountid")
    +----------+-----+---------+
|      date|  val|accountid|
+----------+-----+---------+
|2018-01-01|100.5|      id1|
|2018-01-02|120.6|      id1|
|2018-01-03|450.2|      id2|
+----------+-----+---------+

I want to transform this dataset into this format

+----------+-----+---------+
|      date|  val|accountid|
+----------+-----+---------+
|2018-01-01|100.5|      id1|
|2018-01-01|  0.0|      id2|
|2018-01-02|120.6|      id1|
|2018-01-02|  0.0|      id2|
|2018-01-03|450.2|      id2|
|2018-01-03|0.0  |      id1|
+----------+-----+---------+

Alper t. Turker · Accepted Answer

You can create reference

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

val Row(minTs: Long, maxTs: Long) = df
  .select(to_date($"date").cast("timestamp").cast("bigint") as "date")
  .select(min($"date"), max($"date")).first

val by =  60 * 60 * 24

val ref = spark
  .range(minTs, maxTs + by, by)
  .select($"id".cast("timestamp").cast("date").cast("string").as("date"))
  .crossJoin(df.select("accountid").distinct)

and outer join with input data:

ref.join(df, Seq("date", "accountid"), "leftouter").na.fill(0.0).show
// +----------+---------+-----+      
// |      date|accountid|  val|
// +----------+---------+-----+
// |2018-01-03|      id1|  0.0|
// |2018-01-01|      id1|100.5|
// |2018-01-02|      id2|  0.0|
// |2018-01-02|      id1|120.6|
// |2018-01-03|      id2|450.2|
// |2018-01-01|      id2|  0.0|
// +----------+---------+-----+

Concept adopted from this sparklyr answer by user6910411.

Transform dataset with empty data for dates

Answers (2)

Related Questions