Reputation: 61
I exploded a column and got the dataframe as below:
+------------+-----------+--------------------+
|serialnumber| roomname| devices|
+------------+-----------+--------------------+
|hello |Living Room| device1|
|hello |Living Room| device2|
|hello |Living Room| device3|
|hello |Living Room| device4|
|hello |Living Room| device5|
|hello | Kitchen| device1|
|hello | Kitchen| device2|
|hello | Kitchen| device3|
|hello | Kitchen| device4|
|hello | Kitchen| device5|
|hello | Bedroom1| device1|
|hello | Bedroom1| device2|
|hello | Bedroom1| device3|
|hello | Bedroom1| device4|
|hello | Bedroom1| device5|
|hello | Bedroom 2| device1|
|hello | Bedroom 2| device2|
|hello | Bedroom 2| device3|
|hello | Bedroom 2| device4|
|hello | Bedroom 2| device5|
|hello | Bedroom3| device1|
|hello | Bedroom3| device2|
|hello | Bedroom3| device3|
|hello | Bedroom3| device4|
|hello | Bedroom3| device5|
+------------+-----------+--------------------+
Now I want a dataframe as below, that means 1st of Living room, 2nd of Kitchen, 3rd of Bedroom1 and so on....
+------------+-----------+--------------------+
|serialnumber| roomname| devices|
+------------+-----------+--------------------+
|hello |Living Room| device1|
|hello | Kitchen| device2|
|hello | Bedroom1| device3|
|hello | Bedroom 2| device4|
|hello | Bedroom 3| device5|
+------------+-----------+--------------------+
Upvotes: 0
Views: 541
Reputation: 717
From what I understand the problem you are having is you lose the order of roomname when you are using explode.
Assuming that roomname is of type Array[..], instead of using explode
you can use posexplode
val df = Seq(
("hello", List[String]("room1", "room2")),
("hello1", List[String]("room1", "room2"))
).toDF("serial", "roomname")
df.select(posexplode($"roomname")).show()
will give you the following output
+---+-----+
|pos| col|
+---+-----+
| 0|room1|
| 1|room2|
| 0|room1|
| 1|room2|
+---+-----+
you can then select the "n-th" row as required by adding a filter
Upvotes: 0
Reputation: 23109
Here is how you can do it with groupBy
and window
function but you need to know an Order of the column
.
import org.apache.spark.sql.functions._
val window = Window.partitionBy("serialnumber").orderBy("roomname")
df.groupBy("serialnumber", "roomname")
.agg(collect_list("devices").as("devices"))
.withColumn("index", rank().over(window))
.withColumn("devices", element_at($"devices", $"index"))
.drop("index")
.show(false)
Output:
+------------+-----------+-------+
|serialnumber|roomname |devices|
+------------+-----------+-------+
|hello |Bedroom 2 |device1|
|hello |Bedroom1 |device2|
|hello |Bedroom3 |device3|
|hello |Kitchen |device4|
|hello |Living Room|device5|
+------------+-----------+-------+
Upvotes: 1