Reputation: 35
I have a peculiar requirement to denormalize data like below:
Source dataframe: Key, item_desc
Target dataframe: Key, item_desc1, item_desc2, item_desc3, item_desc4
for every 4 records on source dataframe I should create one record on target dataframe.
source data:
Key, item_desc
1, desc1
1, desc2
1, desc3
1, desc4
1, desc5
1, desc6
target data:
key, item_desc1, item_desc2, item_desc3, item_desc4
1, desc1, desc2, desc3, desc4
1, desc5, desc6
Can anyone guide me on how to write this code? I have done a sample code to do it on scala like below:
var l = (1 to 102).toList
var n = ""
var j = 1
for (i <- l){
n = n + l(j) + ","
if (j%4 == 0) {
println(n)
n = ""
}
if (j+1 == l.size) println(n)
j = j+1
}
But, I should apply this logic to dataframe/rdd/list.
Please help me on this!
Upvotes: 3
Views: 2078
Reputation: 35229
You could try something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val w = Window.partitionBy("key").orderBy("item_desc")
val df = Seq(
(1, "desc1"), (1, "desc2"), (1, "desc3"),
(1, "desc4"), (1, "desc5"), (1, "desc6")
).toDF("key", "item_desc")
df
// Add sequential id for group 0 .. n - 1
.withColumn("id", row_number.over(w) - 1)
// Add row group id
.withColumn("group_id", floor($"id" / 4))
// Add column group id
.withColumn("column_id", concat(lit("item_desc"), $"id" % 4))
.groupBy("key", "group_id")
.pivot("column_id")
.agg(first("item_desc"))
.drop("group_id").show
// +---+----------+----------+----------+----------+
// |key|item_desc0|item_desc1|item_desc2|item_desc3|
// +---+----------+----------+----------+----------+
// | 1| desc1| desc2| desc3| desc4|
// | 1| desc5| desc6| null| null|
// +---+----------+----------+----------+----------+
but unless number of values associated with a single key is small, it won't scale well, so use at your own risk.
Upvotes: 2