AMAN DUTTA
AMAN DUTTA

Reputation: 61

Spark Scala - I need nth row of the group

I exploded a column and got the dataframe as below:

+------------+-----------+--------------------+
|serialnumber|   roomname|             devices|
+------------+-----------+--------------------+
|hello       |Living Room|             device1|
|hello       |Living Room|             device2|
|hello       |Living Room|             device3|
|hello       |Living Room|             device4|
|hello       |Living Room|             device5|
|hello       |    Kitchen|             device1|
|hello       |    Kitchen|             device2|
|hello       |    Kitchen|             device3|
|hello       |    Kitchen|             device4|
|hello       |    Kitchen|             device5|
|hello       |   Bedroom1|             device1|
|hello       |   Bedroom1|             device2|
|hello       |   Bedroom1|             device3|
|hello       |   Bedroom1|             device4|
|hello       |   Bedroom1|             device5|
|hello       |  Bedroom 2|             device1|
|hello       |  Bedroom 2|             device2|
|hello       |  Bedroom 2|             device3|
|hello       |  Bedroom 2|             device4|
|hello       |  Bedroom 2|             device5|
|hello       |   Bedroom3|             device1|
|hello       |   Bedroom3|             device2|
|hello       |   Bedroom3|             device3|
|hello       |   Bedroom3|             device4|
|hello       |   Bedroom3|             device5|
+------------+-----------+--------------------+

Now I want a dataframe as below, that means 1st of Living room, 2nd of Kitchen, 3rd of Bedroom1 and so on....

   +------------+-----------+--------------------+
    |serialnumber|   roomname|             devices|
    +------------+-----------+--------------------+
    |hello       |Living Room|             device1|
    |hello       |    Kitchen|             device2|
    |hello       |   Bedroom1|             device3|
    |hello       |  Bedroom 2|             device4|
    |hello       |  Bedroom 3|             device5|
    +------------+-----------+--------------------+

Upvotes: 0

Views: 541

Answers (2)

From what I understand the problem you are having is you lose the order of roomname when you are using explode.

Assuming that roomname is of type Array[..], instead of using explode you can use posexplode


 val df = Seq(
    ("hello", List[String]("room1", "room2")),
    ("hello1", List[String]("room1", "room2"))
  ).toDF("serial", "roomname")


df.select(posexplode($"roomname")).show()

will give you the following output

+---+-----+
|pos|  col|
+---+-----+
|  0|room1|
|  1|room2|
|  0|room1|
|  1|room2|
+---+-----+

you can then select the "n-th" row as required by adding a filter

Upvotes: 0

koiralo
koiralo

Reputation: 23109

Here is how you can do it with groupBy and window function but you need to know an Order of the column.

import org.apache.spark.sql.functions._

val window = Window.partitionBy("serialnumber").orderBy("roomname")

df.groupBy("serialnumber", "roomname")
  .agg(collect_list("devices").as("devices"))
  .withColumn("index", rank().over(window))
  .withColumn("devices", element_at($"devices", $"index"))
  .drop("index")
  .show(false)

Output:

+------------+-----------+-------+
|serialnumber|roomname   |devices|
+------------+-----------+-------+
|hello       |Bedroom 2  |device1|
|hello       |Bedroom1   |device2|
|hello       |Bedroom3   |device3|
|hello       |Kitchen    |device4|
|hello       |Living Room|device5|
+------------+-----------+-------+

Upvotes: 1

Related Questions