Alan Featherston
Alan Featherston

Reputation: 1086

Rearranging StrucType and nested array of struct

I have a dataframe with the schema:

root
 |-- col2: integer (nullable = true)
 |-- col1: integer (nullable = true)
 |-- structCol3: struct (nullable = true)
 |    |-- structField2: boolean (nullable = true)
 |    |-- structField1: string (nullable = true)
 |-- structCol4: struct (nullable = true)
 |    |-- nestedArray: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- elem3: double (nullable = true)
 |    |    |    |-- elem2: string (nullable = true)
 |    |    |    |-- elem1: string (nullable = true)
 |    |-- structField2: integer (nullable = true)

Desired schema:

root
 |-- col1: integer (nullable = true) 
 |-- col2: integer (nullable = true)
 |-- structCol3: struct (nullable = true)
 |    |-- structField1: string (nullable = true)
 |    |-- structField2: boolean (nullable = true)
 |-- structCol4: struct (nullable = true)
 |    |-- nestedArray: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- elem1: string (nullable = true)
 |    |    |    |-- elem2: string (nullable = true)
 |    |    |    |-- elem3: double (nullable = true)
 |    |-- structField2: integer (nullable = true)

So far I've had success rearranging the columns and the fields inside the structs like this:

dfParquetOutput = df.select(
    "col1",
    "col2",
    struct(
        col("structCol3.structField1"), 
        col("structCol3.structField2")
    ).alias("structCol3"),
    struct(
        col("structCol4.nestedArray"),
        col("structCol4.structField2")
    ).alias("structCol4")
)

Unfortunately I'm struggling to find a way to re-arrange the elements inside the StructType that is inside the Array. I thought about trying to use a udf but I've had no success with it.

Is there a simple way to re-ordering the Struct inside the array?

Upvotes: 3

Views: 1416

Answers (2)

ZygD
ZygD

Reputation: 24488

Spark 2.4+

You can avoid udf by using transform. Replace col("structCol4.nestedArray") with

expr("transform(structCol4.nestedArray, x -> struct(x.elem1, x.elem2, x.elem3)) as nestedArray")

Full test with your data.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame([], 'col2 int, col1 int, structCol3 struct<structField2:boolean,structField1:string>, structCol4 struct<nestedArray:array<struct<elem3:double,elem2:string,elem1:string>>, structField2:int>')
df.printSchema()
# root
#  |-- col2: integer (nullable = true)
#  |-- col1: integer (nullable = true)
#  |-- structCol3: struct (nullable = true)
#  |    |-- structField2: boolean (nullable = true)
#  |    |-- structField1: string (nullable = true)
#  |-- structCol4: struct (nullable = true)
#  |    |-- nestedArray: array (nullable = true)
#  |    |    |-- element: struct (containsNull = true)
#  |    |    |    |-- elem3: double (nullable = true)
#  |    |    |    |-- elem2: string (nullable = true)
#  |    |    |    |-- elem1: string (nullable = true)
#  |    |-- structField2: integer (nullable = true)

Transformation:

df = df.select(
    "col1",
    "col2",
    F.struct(
        "structCol3.structField1", 
        "structCol3.structField2"
    ).alias("structCol3"),
    F.struct(
        F.expr("transform(structCol4.nestedArray, x -> struct(x.elem1, x.elem2, x.elem3)) as nestedArray"),
        "structCol4.structField2"
    ).alias("structCol4")
)
df.printSchema()
# root
#  |-- col1: integer (nullable = true)
#  |-- col2: integer (nullable = true)
#  |-- structCol3: struct (nullable = false)
#  |    |-- structField1: string (nullable = true)
#  |    |-- structField2: boolean (nullable = true)
#  |-- structCol4: struct (nullable = false)
#  |    |-- nestedArray: array (nullable = true)
#  |    |    |-- element: struct (containsNull = false)
#  |    |    |    |-- elem1: string (nullable = true)
#  |    |    |    |-- elem2: string (nullable = true)
#  |    |    |    |-- elem3: double (nullable = true)
#  |    |-- structField2: integer (nullable = true)

Upvotes: 0

Alper t. Turker
Alper t. Turker

Reputation: 35249

You cannot really avoid udf (or RDD) here. If you define data as

from pyspark.sql.functions import udf, struct, col
from collections import namedtuple

Outer = namedtuple("Outer", ["structCol4"])
Inner = namedtuple("Inner", ["nestedArray", "structField2"])
Element = namedtuple("Element", ["col3", "col2", "col1"])

df = spark.createDataFrame([Outer(Inner([Element("3", "2", "1")], 1))])

You can

@udf("array<struct<col1: string, col2: string, col3: string>>")
def reorder(arr):
    return [(col1, col2, col3) for col3, col2, col1 in arr]

result = df.withColumn(
    "structCol4", 
     struct(reorder("structCol4.nestedArray").alias("nestedArray"), col("structCol4.structField2")))

result.printSchema()
# root
#  |-- structCol4: struct (nullable = false)
#  |    |-- nestedArray: array (nullable = true)
#  |    |    |-- element: struct (containsNull = true)
#  |    |    |    |-- col1: string (nullable = true)
#  |    |    |    |-- col2: string (nullable = true)
#  |    |    |    |-- col3: string (nullable = true)
#  |    |-- structField2: long (nullable = true)
# 


result.show()
# +----------------+
# |      structCol4|
# +----------------+
# |[[[1, 2, 3]], 1]|
# +----------------+

With deeply nested schemas you'll have restructure complete tree inside udf, but it is not required here.

Upvotes: 1

Related Questions