Reputation: 189
I have a PySpark dataframe of students with schema as follows:
Id: string
|-- School: array
|-- element: struct
| |-- Subject: string
| |-- Classes: string
| |-- Score: array
| |-- element: struct
| |-- ScoreID: string
| |-- Value: string
I want to extract a few fields from the data frame and normalize it so that I can feed it in the database. The relational schema I expect consists of the fields Id, School, Subject, ScoreId, Value
. How can I do it efficiently?
Upvotes: 1
Views: 445
Reputation: 31540
explode
the array to get flattened data and then select all the required columns.
Example:
df.show(10,False)
#+---+--------------------------+
#|Id |School |
#+---+--------------------------+
#|1 |[[b, [[A, 3], [B, 4]], a]]|
#+---+--------------------------+
df.printSchema()
#root
# |-- Id: string (nullable = true)
# |-- School: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- Classes: string (nullable = true)
# | | |-- Score: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- ScoreID: string (nullable = true)
# | | | | |-- Value: string (nullable = true)
# | | |-- Subject: string (nullable = true)
df.selectExpr("Id","explode(School)").\
selectExpr("Id","col.*","explode(col.Score)").\
selectExpr("Id","Classes","Subject","col.*").\
show()
#+---+-------+-------+-------+-----+
#| Id|Classes|Subject|ScoreID|Value|
#+---+-------+-------+-------+-----+
#| 1| b| a| A| 3|
#| 1| b| a| B| 4|
#+---+-------+-------+-------+-----+
Upvotes: 2