Reputation: 1086
I have a dataframe with the schema:
root
|-- col2: integer (nullable = true)
|-- col1: integer (nullable = true)
|-- structCol3: struct (nullable = true)
| |-- structField2: boolean (nullable = true)
| |-- structField1: string (nullable = true)
|-- structCol4: struct (nullable = true)
| |-- nestedArray: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- elem3: double (nullable = true)
| | | |-- elem2: string (nullable = true)
| | | |-- elem1: string (nullable = true)
| |-- structField2: integer (nullable = true)
Desired schema:
root
|-- col1: integer (nullable = true)
|-- col2: integer (nullable = true)
|-- structCol3: struct (nullable = true)
| |-- structField1: string (nullable = true)
| |-- structField2: boolean (nullable = true)
|-- structCol4: struct (nullable = true)
| |-- nestedArray: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- elem1: string (nullable = true)
| | | |-- elem2: string (nullable = true)
| | | |-- elem3: double (nullable = true)
| |-- structField2: integer (nullable = true)
So far I've had success rearranging the columns and the fields inside the structs like this:
dfParquetOutput = df.select(
"col1",
"col2",
struct(
col("structCol3.structField1"),
col("structCol3.structField2")
).alias("structCol3"),
struct(
col("structCol4.nestedArray"),
col("structCol4.structField2")
).alias("structCol4")
)
Unfortunately I'm struggling to find a way to re-arrange the elements inside the StructType that is inside the Array. I thought about trying to use a udf but I've had no success with it.
Is there a simple way to re-ordering the Struct inside the array?
Upvotes: 3
Views: 1416
Reputation: 24488
Spark 2.4+
You can avoid udf by using transform
. Replace col("structCol4.nestedArray")
with
expr("transform(structCol4.nestedArray, x -> struct(x.elem1, x.elem2, x.elem3)) as nestedArray")
Full test with your data.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame([], 'col2 int, col1 int, structCol3 struct<structField2:boolean,structField1:string>, structCol4 struct<nestedArray:array<struct<elem3:double,elem2:string,elem1:string>>, structField2:int>')
df.printSchema()
# root
# |-- col2: integer (nullable = true)
# |-- col1: integer (nullable = true)
# |-- structCol3: struct (nullable = true)
# | |-- structField2: boolean (nullable = true)
# | |-- structField1: string (nullable = true)
# |-- structCol4: struct (nullable = true)
# | |-- nestedArray: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- elem3: double (nullable = true)
# | | | |-- elem2: string (nullable = true)
# | | | |-- elem1: string (nullable = true)
# | |-- structField2: integer (nullable = true)
Transformation:
df = df.select(
"col1",
"col2",
F.struct(
"structCol3.structField1",
"structCol3.structField2"
).alias("structCol3"),
F.struct(
F.expr("transform(structCol4.nestedArray, x -> struct(x.elem1, x.elem2, x.elem3)) as nestedArray"),
"structCol4.structField2"
).alias("structCol4")
)
df.printSchema()
# root
# |-- col1: integer (nullable = true)
# |-- col2: integer (nullable = true)
# |-- structCol3: struct (nullable = false)
# | |-- structField1: string (nullable = true)
# | |-- structField2: boolean (nullable = true)
# |-- structCol4: struct (nullable = false)
# | |-- nestedArray: array (nullable = true)
# | | |-- element: struct (containsNull = false)
# | | | |-- elem1: string (nullable = true)
# | | | |-- elem2: string (nullable = true)
# | | | |-- elem3: double (nullable = true)
# | |-- structField2: integer (nullable = true)
Upvotes: 0
Reputation: 35249
You cannot really avoid udf
(or RDD) here. If you define data as
from pyspark.sql.functions import udf, struct, col
from collections import namedtuple
Outer = namedtuple("Outer", ["structCol4"])
Inner = namedtuple("Inner", ["nestedArray", "structField2"])
Element = namedtuple("Element", ["col3", "col2", "col1"])
df = spark.createDataFrame([Outer(Inner([Element("3", "2", "1")], 1))])
You can
@udf("array<struct<col1: string, col2: string, col3: string>>")
def reorder(arr):
return [(col1, col2, col3) for col3, col2, col1 in arr]
result = df.withColumn(
"structCol4",
struct(reorder("structCol4.nestedArray").alias("nestedArray"), col("structCol4.structField2")))
result.printSchema()
# root
# |-- structCol4: struct (nullable = false)
# | |-- nestedArray: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- col1: string (nullable = true)
# | | | |-- col2: string (nullable = true)
# | | | |-- col3: string (nullable = true)
# | |-- structField2: long (nullable = true)
#
result.show()
# +----------------+
# | structCol4|
# +----------------+
# |[[[1, 2, 3]], 1]|
# +----------------+
With deeply nested schemas you'll have restructure complete tree inside udf, but it is not required here.
Upvotes: 1