NoviceDeveloper
NoviceDeveloper

Reputation: 1

Create Array of Struct with different columns(Structure) in PySpark

I have two structs

email_struct:
 |-- email_struct: struct (nullable = true)
 |    |-- UsageTypeDesc: string (nullable = true)
 |    |-- ContactInfo: struct (nullable = false)
 |    |    |-- ElectronicAddress: struct (nullable = false)
 |    |    |    |-- AddressSubtype: string (nullable = false)
 |    |    |    |-- SourceSystemTypeDesc: string (nullable = false)
 |    |    |    |-- ElectronicAddressTxt: string (nullable = true)


phone_struct:
 |-- phone_struct: struct (nullable = true)
 |    |-- UsageTypeDesc: string (nullable = true)
 |    |-- ContactInfo: struct (nullable = false)
 |    |    |-- TelephoneNumber: struct (nullable = false)
 |    |    |    |-- AddressSubtype: string (nullable = false)
 |    |    |    |-- SourceSystemTypeDesc: string (nullable = false)
 |    |    |    |-- TelephoneNum: string (nullable = true)

How can I create array of struct in pyspark

"ContactInfo": [
    {
        "UsageTypeDesc": "PHONE",
        "ContactInfo": {
            "TelephoneNumber": {
                "AddressSubtype": "TELEPHONE NUMBER",
                "SourceSystemTypeDesc": "",
                "TelephoneNum": ""
            }
        }
    },
    {
        "UsageTypeDesc": "EMAIL",
        "ContactInfo": {
            "ElectronicAddress": {
                "AddressSubtype": "EMAIL ADDRESS",
                "SourceSystemTypeDesc": "",
                "ElectronicAddressTxt": ""
            }
        }
    }
]

Error : pyspark.errors.exceptions.captured.AnalysisException: [DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "array(phone_struct, email_struct)" due to data type mismatch: Input to array should all be the same type, but it's ("STRUCT<UsageTypeDesc: STRING, ContactInfo: STRUCT<TelephoneNumber: STRUCT<AddressSubtype: STRING, SourceSystemTypeDesc: STRING, TelephoneNum: STRING>>>" or "STRUCT<UsageTypeDesc: STRING, ContactInfo: STRUCT<ElectronicAddress: STRUCT<AddressSubtype: STRING, SourceSystemTypeDesc: STRING, ElectronicAddressTxt: STRING>>>").;

I am getting following error If I use following :

df = df.withColumn("ContactInfo", array(
     col("phone_struct")),
     col("email_struct")),
    )
)

Upvotes: 0

Views: 85

Answers (1)

Derek O
Derek O

Reputation: 19610

You cannot create an ArrayType from two fields that have different schemas. I am not sure about your use case, but you could instead combine the two fields in another struct:

df.withColumn("ContactInfo", F.struct(
     F.col("phone_struct"),
     F.col("email_struct"),
))

Upvotes: 0

Related Questions