PySpark: Dataframe with nested fields to relational table

Question

I have a PySpark dataframe of students with schema as follows:

Id: string
 |-- School: array
 |-- element: struct
 |   |-- Subject: string
 |   |-- Classes: string
 |   |-- Score: array
 |       |-- element: struct
 |           |-- ScoreID: string
 |           |-- Value: string

I want to extract a few fields from the data frame and normalize it so that I can feed it in the database. The relational schema I expect consists of the fields Id, School, Subject, ScoreId, Value. How can I do it efficiently?

notNull · Accepted Answer

explode the array to get flattened data and then select all the required columns.

Example:

df.show(10,False)
#+---+--------------------------+
#|Id |School                    |
#+---+--------------------------+
#|1  |[[b, [[A, 3], [B, 4]], a]]|
#+---+--------------------------+

df.printSchema()
#root
# |-- Id: string (nullable = true)
# |-- School: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- Classes: string (nullable = true)
# |    |    |-- Score: array (nullable = true)
# |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |-- ScoreID: string (nullable = true)
# |    |    |    |    |-- Value: string (nullable = true)
# |    |    |-- Subject: string (nullable = true)

df.selectExpr("Id","explode(School)").\
selectExpr("Id","col.*","explode(col.Score)").\
selectExpr("Id","Classes","Subject","col.*").\
show()
#+---+-------+-------+-------+-----+
#| Id|Classes|Subject|ScoreID|Value|
#+---+-------+-------+-------+-----+
#|  1|      b|      a|      A|    3|
#|  1|      b|      a|      B|    4|
#+---+-------+-------+-------+-----+

PySpark: Dataframe with nested fields to relational table

Answers (1)

Related Questions