Anu
Anu

Reputation: 1

Voting classifier UDF in pyspark

I am trying to implement a voting classifier in pyspark.

I used the function predict_from_multiple_estimator. The arguments passed to the function are estimators1 which are trained and fitted pipeline models in pyspark, X the test dataframe, possible class labels and weight values.

Then I tried to convert this fucntion into pyspark UDF. And called the function with the test dataframe qa feature attribute for predicting the class label.

estimators1 = [S1, S2]

#were S1 and S2 are spark pipeline models pipeline(featurizer,pca,logistic regression and naive bayesian)

w = [1,1]

label_encoder = [0, 1, 2]

def predictestimator(X, label_encoder, estimators=estimators1, weights=w):

# Predict 'soft' voting with probabilities

p1 = np.asarray([clf.predict_proba(X) for clf, X in zip(estimators, X_list)])
p2 = np.average(p1, axis=0, weights=weights)
p = np.argmax(p2, axis=1)

# Convert integer predictions to original labels:
return label_encoder.inverse_transform(p)

from pyspark.sql.functions import udf

udf1 = udf(predictestimator)

qa = featurizer.transform(test)# qa is a dataframe in pyspark which consists of features of images 
                                                                                                 qa is DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>,features: vector]

qa.withColumn("predictedlabel", udf1("features")).show() # when this statement is run it produces the error 

Error I'm getting:

PicklingError: Could not serialize object: TypeError: can't pickle dict_keys objects

Upvotes: 0

Views: 344

Answers (1)

Matt Andruff
Matt Andruff

Reputation: 5125

I found out why this won't work. Python 3 changed the way things work with dict_keys. Check out this very good explanation.

Upvotes: 0

Related Questions