bda
bda

Reputation: 422

Convert array to struct in dataframe

In my dataframe, I need to convert an array data type column to struct. I can manually do that with a sample of data (by modifying in editor) and it is the data that I need. I need to do it in PySpark.

Input dataframe schema:

root
 |-- id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- documents: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- doc_name: string (nullable = true)
 |    |    |-- obligations: struct (containsNull = true)
 |-- contacts: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- contact_first_name: string (nullable = true)
 |    |    |-- contact_last_name: string (nullable = true)

Data:

{
   "id":"123",
   "description": "agreement",
   "documents":[
     {
       "id":"doc_id_1",
       "doc_name":"doc_name_1",
       "obligations":{}
     }
   ],
   "contacts":[
    {
      "id":"contact_id_1",
      "contact_first_name":"John",
      "contact_last_name":"Doe"
    }
  ]
}

Schema that I need:

root
 |-- id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- documents: struct (containsNull = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- doc_name: string (nullable = true)
 |    |    |-- obligations: struct (containsNull = true)
 |-- contacts: struct (containsNull = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- contact_first_name: string (nullable = true)
 |    |    |-- contact_last_name: string (nullable = true)

Data that I need:

{
   "id":"123",
   "description": "agreement",
   "documents":{
     {
       "id":"doc_id_1",
       "doc_name":"doc_name_1",
       "obligations":{}
     }
   },
   "contacts":{
    {
      "id":"contact_id_1",
      "contact_first_name":"John",
      "contact_last_name":"Doe"
    }
  }
}

Upvotes: 1

Views: 318

Answers (1)

ZygD
ZygD

Reputation: 24356

Arrays differ from structs in a way that arrays can hold many items. In your current setup, you have an array of structs - that array may potentially hold many structs. Only if you are sure that your array holds just one struct, you can safely just extract the first element in the array and put it one level higher (removing the array) like this:

df = df.withColumn('contacts', F.col('contacts')[0])

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [("123", "agreement", [("doc_id_1", "doc_name_1",())], [("contact_id_1", "John", "Doe")],)],
    "id string, description string, documents array<struct<id:string,doc_name:string,obligations:struct<>>>, contacts array<struct<id:string,contact_first_name:string,contact_last_name:string>>")
df.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- description: string (nullable = true)
#  |-- documents: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- id: string (nullable = true)
#  |    |    |-- doc_name: string (nullable = true)
#  |    |    |-- obligations: struct (nullable = true)
#  |-- contacts: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- id: string (nullable = true)
#  |    |    |-- contact_first_name: string (nullable = true)
#  |    |    |-- contact_last_name: string (nullable = true)

df = df.withColumn('documents', F.col('documents')[0])
df = df.withColumn('contacts', F.col('contacts')[0])

df.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- description: string (nullable = true)
#  |-- documents: struct (nullable = true)
#  |    |-- id: string (nullable = true)
#  |    |-- doc_name: string (nullable = true)
#  |    |-- obligations: struct (nullable = true)
#  |-- contacts: struct (nullable = true)
#  |    |-- id: string (nullable = true)
#  |    |-- contact_first_name: string (nullable = true)
#  |    |-- contact_last_name: string (nullable = true)

Upvotes: 1

Related Questions