Vivian
Vivian

Reputation: 105

Add a field already exists in df pyspark in struct field

I have the follow df:

 sku    category   price   state    infos_gerais
33344    mmmma     3.00     SP      [{5, 5656655, 5845454}]
33344    mmmma     3.00     MG      [{5, 6565767, 5854545}]
33344    mmmma     3.00     RS      [{5, 8788787, 4564646}]

The schema of df follow:

|-- sku: string (nullable = true)
|-- category: string (nullable = true)
|-- price: double (nullable = true)
|-- state: string (nullable = true)
|-- infos_gerais: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- service_type_id: integer (nullable = true)
|    |    |-- cep_ini: integer (nullable = true)
|    |    |-- cep_fim: integer (nullable = true)

See that in df the field that don't repeat is 'state', so I need insert this field in struct 'infos_gerais' and apply a groupBy, so I try this below code, but return a error. Anyone can help me?

df_end = df_end.withColumn(
     "infos_gerais",
      sf.collect_list(
     sf.struct(
         sf.col("infos_gerais.*"),
         sf.col('infos_gerais.state').alias('state'))
     )
)

I need the follow df output:

sku    category   price      infos_gerais
33344    mmmma     3.00   [{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG},{5, 8788787, 4564646, RS}]

Upvotes: 0

Views: 72

Answers (1)

samkart
samkart

Reputation: 6644

given you have an array of structs, you can use transform to process the elements of the array and withField on the structs to add/replace a struct field.

here's a simple example

data_sdf. \
    withColumn('infos_gerais', 
               func.transform('infos_gerais', lambda x: x.withField('state', func.col('state')))
               ). \
    groupBy('sku', 'category', 'price'). \
    agg(func.flatten(func.collect_list('infos_gerais')).alias('infos_gerais')). \
    show(truncate=False)

# +-----+--------+-----+---------------------------------------------------------------------------------+
# |sku  |category|price|infos_gerais                                                                     |
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |33344|mmmma   |3.0  |[{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG}, {5, 8788787, 4564646, RS}]|
# +-----+--------+-----+---------------------------------------------------------------------------------+

# root
#  |-- sku: string (nullable = true)
#  |-- category: string (nullable = true)
#  |-- price: double (nullable = true)
#  |-- infos_gerais: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- service_type_id: integer (nullable = true)
#  |    |    |-- cep_ini: integer (nullable = true)
#  |    |    |-- cep_fim: integer (nullable = true)
#  |    |    |-- state: string (nullable = true)

Upvotes: 1

Related Questions