Partial Replication of DataFrame rows

Question

I have a Dataframe which has the following structure and data

Source:

Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 02/01/2021 
M, N, 05/01/2021

I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a fixed date (until 07/01/2021 in this example) for each of the source row)

To Result:

1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
1, 2, 05/01/2021
1, 2, 06/01/2021
1, 2, 07/01/2021
A, B, 02/01/2021 
A, B, 03/01/2021 
A, B, 04/01/2021 
A, B, 05/01/2021 
A, B, 06/01/2021 
A, B, 07/01/2021 
M, N, 05/01/2021
M, N, 06/01/2021
M, N, 07/01/2021

Any idea on how this can be achieved in scala spark?

I got this link Replicate Spark Row N-times, but there is no hint on how a particular column can be incremented during replication.

Mohana B C · Accepted Answer

We can use sequence function to generate list of dates in required range, then explode the output array of sequence function to get dataframe in required format.

val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")

// Sample dataframe
val df = List(("1", "2", "01/01/2021"),
              ("A", "B", "02/01/2021"),
              ("M", "N", "05/01/2021"))
            .toDF("Column1(String)", "Column2(String)", "Date")

df
  .withColumn("Date",explode_outer(sequence(to_date('Date,"dd/MM/yyyy"),
                                    to_date(lit("07/01/2021"),"dd/MM/yyyy"))))
  .withColumn("Date",date_format('Date,"dd/MM/yyyy"))
  .show(false)

/*
+---------------+---------------+----------+
|Column1(String)|Column2(String)|Date      |
+---------------+---------------+----------+
|1              |2              |01/01/2021|
|1              |2              |02/01/2021|
|1              |2              |03/01/2021|
|1              |2              |04/01/2021|
|1              |2              |05/01/2021|
|1              |2              |06/01/2021|
|1              |2              |07/01/2021|
|A              |B              |02/01/2021|
|A              |B              |03/01/2021|
|A              |B              |04/01/2021|
|A              |B              |05/01/2021|
|A              |B              |06/01/2021|
|A              |B              |07/01/2021|
|M              |N              |05/01/2021|
|M              |N              |06/01/2021|
|M              |N              |07/01/2021|
+---------------+---------------+----------+ */

Partial Replication of DataFrame rows

Answers (1)

Related Questions