Reputation: 46
I'm having trouble with json conversion within pyspark working with complex nested-struct columns. The schema for the from_json doesn't seem to behave. Example:
import pyspark.sql.functions as f
df = spark.createDataFrame([[1,'a'],[2,'b'],[3,'c']], ['rownum','rowchar'])\
.withColumn('struct', f.expr("transform(array(1,2,3), i -> named_struct('a1',rownum*i,'a2',rownum*i*2))"))
df.display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.schema['struct'])).display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.select('struct').schema)).display()
fails with
Cannot parse the schema in JSON format: Failed to convert the JSON string (big JSON string) to a data type
Not sure if this is a syntax error on my end, an edge case that's failing, the wrong way to do things, or something else.
Upvotes: 0
Views: 1495
Reputation: 32640
You're not passing the correct schema to from_json
. Try with this instead:
df.withColumn('struct', f.to_json('struct')) \
.withColumn('struct', f.from_json('struct', df.schema["struct"].dataType)) \
.display()
Upvotes: 1