Reputation: 145
I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 02/01/2021
M, N, 05/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a fixed date (until 07/01/2021 in this example) for each of the source row)
To Result:
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
1, 2, 05/01/2021
1, 2, 06/01/2021
1, 2, 07/01/2021
A, B, 02/01/2021
A, B, 03/01/2021
A, B, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
M, N, 05/01/2021
M, N, 06/01/2021
M, N, 07/01/2021
Any idea on how this can be achieved in scala spark?
I got this link Replicate Spark Row N-times, but there is no hint on how a particular column can be incremented during replication.
Upvotes: 1
Views: 47
Reputation: 5487
We can use sequence function to generate list of dates in required range, then explode the output array of sequence function to get dataframe in required format.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
// Sample dataframe
val df = List(("1", "2", "01/01/2021"),
("A", "B", "02/01/2021"),
("M", "N", "05/01/2021"))
.toDF("Column1(String)", "Column2(String)", "Date")
df
.withColumn("Date",explode_outer(sequence(to_date('Date,"dd/MM/yyyy"),
to_date(lit("07/01/2021"),"dd/MM/yyyy"))))
.withColumn("Date",date_format('Date,"dd/MM/yyyy"))
.show(false)
/*
+---------------+---------------+----------+
|Column1(String)|Column2(String)|Date |
+---------------+---------------+----------+
|1 |2 |01/01/2021|
|1 |2 |02/01/2021|
|1 |2 |03/01/2021|
|1 |2 |04/01/2021|
|1 |2 |05/01/2021|
|1 |2 |06/01/2021|
|1 |2 |07/01/2021|
|A |B |02/01/2021|
|A |B |03/01/2021|
|A |B |04/01/2021|
|A |B |05/01/2021|
|A |B |06/01/2021|
|A |B |07/01/2021|
|M |N |05/01/2021|
|M |N |06/01/2021|
|M |N |07/01/2021|
+---------------+---------------+----------+ */
Upvotes: 1