nilesh1212
nilesh1212

Reputation: 1655

Data type mismatch: cannot cast struct for Pyspark struct field cast

I am facing an exception, I have a dataframe with a column "hid_tagged" as struct datatype, My requirement is to change column "hid_tagged" struct schema by appending "hid_tagged" to the struct field names which was shown below. I am following below steps and getting "data type mismatch: cannot cast structure" exception.

Can you let me know what I am missing here.

df2=df.select(col("hid_tagged").cast(transform_schema(df.schema)))

org.apache.spark.sql.AnalysisException: cannot resolve '`hid_tagged`' due to data type mismatch: cannot cast struct&

I am able to generate expected struct schema changes using below udf:

UDF FOR SCHEMA CONVERSION:

from pyspark.sql.types import StructField
from pyspark.sql.types import StructType

def transform_schema(schema):

  if schema == None:
    return StructType()

  updated = []
  for f in schema.fields:
    if isinstance(f.dataType, StructType):
      
      updated.append(StructField(f.name, transform_schema(f.dataType)))
      
    else:
      
      updated.append(StructField(str("hid_tagged"+f.name),f.dataType, f.nullable))

  return StructType(updated)

Source struct schema:

hid_tagged:struct
    field_1:long
    field_2:long
    field_3:string
    field_4:array
        element:string
    field_5:string
    field_6:string
    field_7:long
    field_8:long
    field_9:long
    field_10:boolean
    field_11:string
    field_12:long
    field_13:long
    field_14:long
    field_15:long
    field_16:long
    field_17:long
    field_18:long
    field_19:long
    field_20:long

Expected Struct schema:

hid_tagged:struct
    hid_tagged_field_1:long
    hid_tagged_field_2:long
    hid_tagged_field_3:string
    hid_tagged_field_4:array
        element:string
    hid_tagged_field_5:string
    hid_tagged_field_6:string
    hid_tagged_field_7:long
    hid_tagged_field_8:long
    hid_tagged_field_9:long
    hid_tagged_field_10:boolean
    hid_tagged_field_11:string
    hid_tagged_field_12:long
    hid_tagged_field_13:long
    hid_tagged_field_14:long
    hid_tagged_field_15:long
    hid_tagged_field_16:long
    hid_tagged_field_17:long
    hid_tagged_field_18:long
    hid_tagged_field_19:long
    hid_tagged_field_20:long

Upvotes: 0

Views: 5387

Answers (1)

mck
mck

Reputation: 42352

Try this:

df2 = df.select(col("hid_tagged").cast(transform_schema(df.schema)['hid_tagged'].dataType))

transform_schema(df.schema) returns the transformed schema for the whole dataframe. You need to pick out the data type of the hid_tagged column before casting.

Upvotes: 1

Related Questions