Reputation: 451
In my source data, I don't have any category column on which I can group data.
So, I want to add a new custom category column in Spark dataframe on the basis of another column say("QNAME") distinct value. Which I want to use in group By to Pivot.
Till now, I have no idea... how to get this.
Let's say I have below source dataframe
Qname b c d
SPT 1 10 555
MTK 2 20 556
NKP 3 30 557
LKM 4 40 558
SPT 5 50 559
MTK 7 70 561
QRS 6 60 560
NKP 7 70 561
SPT 5 50 559
LKM 7 70 561
QRS 7 70 561
MTK 7 70 561
NKP 7 70 561
So, on the basis of column("QNAME") values, I want to categories on distinct value combination. For example, ditinct values are (SPT,MTK,NKP,LKM,QRS)... So for first ocuurence of this value will maked as "aaa", then 2nd occurence will marked as "bbb" and so on.
So, below is the expected output.
Category Qname b c d
"aaa" SPT 1 10 555
"aaa" MTK 2 20 556
"aaa" NKP 3 30 557
"aaa" LKM 4 40 558
"bbb" SPT 5 50 559
"bbb" MTK 7 70 561
"aaa" QRS 6 60 560
"bbb" NKP 7 70 561
"ccc" SPT 5 50 559
"bbb" LKM 7 70 561
"bbb" QRS 7 70 561
"ccc" MTK 7 70 561
"ccc" NKP 7 70 561
I am using Scala and Spark to do this. Any help or suggestion is appreciated to crack this. Thanks in advance!
Upvotes: 1
Views: 1135
Reputation: 7207
Task can be completed with Window function "row_number()". If take into account Jonathan Myers comment, better use just numbers instead of characters:
val df = Seq(
("SPT", 1, 10, 555),
("MTK", 2, 20, 556),
("NKP", 3, 30, 557),
("LKM", 4, 40, 558),
("SPT", 5, 50, 559),
("MTK", 7, 70, 561),
("QRS", 6, 60, 560),
("NKP", 7, 70, 561),
("SPT", 5, 50, 559),
("LKM", 7, 70, 561),
("QRS", 7, 70, 561),
("MTK", 7, 70, 561),
("NKP", 7, 70, 561)
).toDF(
"Qname", "b", "c", "d"
)
// action
val categoryWindow = Window.partitionBy($"Qname").orderBy("c")
val result = df.withColumn("Category", row_number().over(categoryWindow))
Result:
+--------+-----+---+---+---+
|Category|Qname|b |c |d |
+--------+-----+---+---+---+
|1 |SPT |1 |10 |555|
|1 |NKP |3 |30 |557|
|1 |QRS |6 |60 |560|
|1 |LKM |4 |40 |558|
|1 |MTK |2 |20 |556|
|2 |NKP |7 |70 |561|
|2 |LKM |7 |70 |561|
|2 |QRS |7 |70 |561|
|2 |SPT |5 |50 |559|
|2 |MTK |7 |70 |561|
|3 |NKP |7 |70 |561|
|3 |MTK |7 |70 |561|
|3 |SPT |5 |50 |559|
+--------+-----+---+---+---+
Upvotes: 1