Vusal
Vusal

Reputation: 33

Xgboost on Spark Validation Indicator Column and Evaluation Metric

I am using the xgboost PySpark API. This API is experimental but it supports most of the features of the xgboost API.

As per the documentation below, eval_set parameter is not supported and instead, validationIndicatorCol parameter should be used.

  1. https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.spark

  2. https://databricks.github.io/spark-deep-learning/#module-sparkdl.xgboost

    xgb = XgboostClassifier(featuresCol = "features", 
                            labelCol="label", 
                            num_workers = 40, 
                            random_state = 1,
                            missing = None, 
                            objective = 'binary:logistic',
                            validationIndicatorCol = 'isVal',
                            eval_metric = 'aucpr' ,
                            n_estimators = best_n_estimators, 
                            max_depth = best_max_depth, 
                            learning_rate = best_learning_rate       
                           )
    
     pipeline = Pipeline(stages=[vectorAssembler,xgb])
     pipelineModel = pipeline.fit(sampled_df)
    

It seems to be running without any errors which is great.

How do you print and look at the evaluation results? Traditional xgboost has evals_result() method which pipelineModel.stages[-1].evals_result() doesn't seem to work in the PySpark API. This method should normally work since the PySpark API documentation doesn't say otherwise. Any idea on how to make it work?

Upvotes: 0

Views: 878

Answers (1)

Shazna
Shazna

Reputation: 1

Assuming you need to see the parameters at the best iteration, this worked for me:

xgb_model = model.stages[-1]
xgb_model.get_booster().attributes() #this returns all the parameters at the best iteration

Upvotes: 0

Related Questions