Spark: How to convert multiple rows into single row with multiple columns?

Question

NOTE: This is just a quick example data. Won't make sense when compared to an actual cricket team.

I have a JSON file as shown below:

{
  "someID": "a5cf4922f4e3f45",
  "payload": {
    "teamID": "1",
    "players": [
      {
        "type": "Batsman",
        "name": "Amar",
        "address": {
          "state": "Gujarat"
        }
      },
      {
        "type": "Bowler",
        "name": "Akbar",
        "address": {
          "state": "Telangana"
        }
      },
      {
        "type": "Fielder",
        "name": "Antony",
        "address": {
          "state": "Kerala"
        }
      }
    ]
  }
}

I have exploded this with the below code:

df_record = spark.read.json("path-to-file.json",multiLine=True)

df_player_dtls = df_record.select("payload.teamID", explode("payload.players").alias("xplayers")) \
                          .select("teamID", \
                                  "xplayers.type", \
                                  "xplayers.name", \
                                  "xplayers.address.state")

df_player_dtls.createOrReplaceTempView("t_player_dtls")

spark.sql("SELECT * FROM t_player_dtls").show()

So the currently the output looks like :

+--------+---------+--------+------------+
| TeamID |  Type   |  Name  |   State    |
+--------+---------+--------+------------+
|      1 | Batsman | Amar   | Gujarat    |
|      1 | Bowler  | Akbar  | Telangana  |
|      1 | Fielder | Antony | Kerala     |
|      2 | Batsman | John   | Queensland |
|      2 | Bowler  | Smith  | Perth      |
+--------+---------+--------+------------+

I want to convert it to the below shown format:

+--------+--------------+---------------+-------------+--------------+--------------+---------------+
| TeamID | Batsman.Name | Batsman.State | Bowler.Name | Bowler.State | Fielder.Name | Fielder.State |
+--------+--------------+---------------+-------------+--------------+--------------+---------------+
|      1 | Amar         | Gujarat       | Akbar       | Telangana    | Antony       | Kerala        |
|      2 | John         | Queensland    | Smith       | Perth        | null         | null          |
+--------+--------------+---------------+-------------+--------------+--------------+---------------+

There will only be one player of each type in a team and there can be at most four types of players in each team (Batsman, Bowler, Fielder, and Wicketkeeper). So the maximum number of players in each team is four. Hence, the final table that will hold this data has nine columns (One for team ID and Name & State for the four players).

Is it possible to accomplish this in Spark? I am a rookie in Spark and so answers that explain the steps would be greatly appreciated.

Anshuman · Accepted Answer

We can use pivot function of pyspark

from pyspark.sql.functions import first

df = df_player_dtls.groupBy("TeamID").pivot("Type").agg(
                            first('Name').alias('Name'),
                            first("State").alias("State"))
df.show(10,False)

Spark: How to convert multiple rows into single row with multiple columns?

Answers (2)

Related Questions