Reputation: 1
I am trying to implement a voting classifier in pyspark.
I used the function predict_from_multiple_estimator
. The arguments passed to the function are estimators1
which are trained and fitted pipeline models in pyspark, X
the test dataframe, possible class labels and weight values.
Then I tried to convert this fucntion into pyspark UDF. And called the function with the test dataframe qa
feature attribute for predicting the class label.
estimators1 = [S1, S2]
#were S1 and S2 are spark pipeline models pipeline(featurizer,pca,logistic regression and naive bayesian)
w = [1,1]
label_encoder = [0, 1, 2]
def predictestimator(X, label_encoder, estimators=estimators1, weights=w):
# Predict 'soft' voting with probabilities
p1 = np.asarray([clf.predict_proba(X) for clf, X in zip(estimators, X_list)])
p2 = np.average(p1, axis=0, weights=weights)
p = np.argmax(p2, axis=1)
# Convert integer predictions to original labels:
return label_encoder.inverse_transform(p)
from pyspark.sql.functions import udf
udf1 = udf(predictestimator)
qa = featurizer.transform(test)# qa is a dataframe in pyspark which consists of features of images
qa is DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>,features: vector]
qa.withColumn("predictedlabel", udf1("features")).show() # when this statement is run it produces the error
Error I'm getting:
PicklingError: Could not serialize object: TypeError: can't pickle dict_keys objects
Upvotes: 0
Views: 344
Reputation: 5125
I found out why this won't work. Python 3 changed the way things work with dict_keys. Check out this very good explanation.
Upvotes: 0