yahoo
yahoo

Reputation: 331

convert array of array to array of struct in pyspark

I have dataframe like below

id  contact_persons
-----------------------
1   [[abc, [email protected], 896676, manager],[pqr, [email protected], 89809043, director],[stu, [email protected], 09909343, programmer]]    

schema looks like this.

root
 |-- id: string (nullable = true)
 |-- contact_persons: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

i need to convert this dataframe like below schema.

 root
 |-- id: string (nullable = true)
 |-- contact_persons: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- emails: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- phone: string (nullable = true)
 |    |    |-- roles: string (nullable = true)

I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.

Upvotes: 1

Views: 501

Answers (1)

Kafels
Kafels

Reputation: 4069

You can use TRANSFORM expression to cast it:

import pyspark.sql.functions as f

df = spark.createDataFrame([
  [1, [['abc', '[email protected]', '896676', 'manager'],
       ['pqr', '[email protected]', '89809043', 'director'],
       ['stu', '[email protected]', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')

expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))

# output_df.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- contact_persons: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- name: string (nullable = true)
#  |    |    |-- emails: string (nullable = true)
#  |    |    |-- phone: string (nullable = true)
#  |    |    |-- roles: string (nullable = true)

output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons                                                                                                        |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1  |[{abc, [email protected], 896676, manager}, {pqr, [email protected], 89809043, director}, {stu, [email protected], 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+

Upvotes: 2

Related Questions