Nabih Bawazir
Nabih Bawazir

Reputation: 7255

How to flatten array of struct?

How to change schema in PySpark from this

|-- id: string (nullable = true)
|-- device: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- device_vendor: string (nullable = true)
|    |    |-- device_name: string (nullable = true)
|    |    |-- device_manufacturer: string (nullable = true)

to this

|-- id: string (nullable = true)
|-- device_vendor: string (nullable = true)
|-- device_name: string (nullable = true)
|-- device_manufacturer: string (nullable = true)

Upvotes: 1

Views: 1902

Answers (2)

Hristo Iliev
Hristo Iliev

Reputation: 74375

Use a combination of explode and the * selector:

import pyspark.sql.functions as F

df_flat = df.withColumn('device_exploded', F.explode('device'))
            .select('id', 'device_exploded.*')

df_flat.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- device_vendor: string (nullable = true)
#  |-- device_name: string (nullable = true)
#  |-- device_manufacturer: string (nullable = true)

explode creates a separate record for each element of the array-valued column, repeating the value(s) of the other column(s). The column.* selector turns all fields of the struct-valued column into separate columns.

Upvotes: 1

ZygD
ZygD

Reputation: 24366

First, take the first array's element using element_at, then extract all elements from struct using *.

df = df.withColumn('d', F.element_at('device', 1))
df = df.select('id', 'd.*')

Upvotes: 1

Related Questions