Reputation: 95
I have pandas udf defined below
schema2 = StructType([ StructField('sensorid', IntegerType(), True),
StructField('confidence', DoubleType(), True)])
@pandas_udf(schema2, PandasUDFType.GROUPED_MAP)
def PreProcess(Indf):
confidence=1
sensor=Indf.iloc[0,0]
df = pd.DataFrame(columns=['sensorid','confidence'])
df['sensorid']=[sensor]
df['confidence']=[0]
return df
I am then passing a spark dataframe with 3 columns into that udf
results.groupby("sensorid").apply(PreProcess)
results:
+--------+---------------+---------------+
|sensorid|sensortimestamp|calculatedvalue|
+--------+---------------+---------------+
| 397332| 1596518086| -39.0|
| 397332| 1596525586| -31.0|
But I keep getting this error:
RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema.Expected: 3 Actual: 4
I can tell what the error is trying to say but I don't understand how this error can pop up. I thought I am returning the correct 2 columns of the dataframe specified in the struct
Upvotes: 7
Views: 2499
Reputation: 2347
apply
is deprecated and it seems that expects to return the same input columns, in this case 3. Try to use applyInPandas
with the expected output schema:
results.groupby("sensorid").applyInPandas(PreProcess, schema=schema2)
Updated links with latest version. (Spark's doc change and links were broken)
In version 3.0.0: apply
applyInPandas
Upvotes: 1