Reputation: 1522
My output_schema for a Pandas UDF contains the following fields:
Out[183]: [StructField(id,StringType,true),
StructField(2018-01-01,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-02,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-03,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-04,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-05,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-06,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-07,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
StructField(2018-01-08,StructType(List(StructField(real,FloatType,true),StructField(imag,FloatType,true))),true),
...
and is of type:
Out[185]: pyspark.sql.types.StructType
What I'm trying to output is a column with an id
while the rest of the columns are tuples which hold two floats. My code for defining the schema is below and basically defines the StructType()
tuple for every column which isn't the id
.
fields = []
for f in json.loads(skeleton_schema.json())["fields"]:
if f["name"] != "id":
fields.append(StructField(f["name"], StructType([
StructField(FloatType(), True),
StructField(FloatType(), True)
]), True))
else:
fields.append(StructField.fromJson(f))
output_schema = StructType(fields)
However, when running my UDF I receive a NotImplementedError
and the output prints my entire schema and says it's not supported. What exactly isn't supported and what am I doing wrong?
Upvotes: 6
Views: 2462