Reputation: 1
I am creating a machine learning model (random forest) in Spark (Pyspark) with cross-validation and grid search. I have two dataframes: one for training and one for testing, both stored in Parquet.
After running the entire pipeline for training with validation and testing of a model, I verified that the reproducibility of the experiment is not guaranteed, that is, even defining a fixed value for 'seed' in all functions that allow this parameter, I cannot obtain exactly the same result when creating a new Spark session and re-running the pipeline.
Performing this test with different databases, I verified that in some cases it is possible to obtain the same confusion matrix in some executions, but the scores are not the same, nor are the cross-validation results (AUC ROC). The codes are always the same. Same Spark configuration. This is valid for decision tree, random forest and gradient boosting. The models were evaluated with precision, recall, f1-score, accuracy, auc roc and auc pr and were proven to be non-deterministic.
My question is: How can I ensure that when executing the same code for model pipeline two or more times, I can obtain the same result? If it is not possible, why is it not possible?
I defined a fixed seed for all functions that allow its use. I repeated the execution dozens of times, always with the same data, same settings and same tested parameters.
Here is a snippet of the experiment:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# spark session
# read dataframe training_data
random_forest = RandomForestClassifier(seed=42, labelCol="label", featuresCol="features")
param_grid = ParamGridBuilder()\
.addGrid(random_forest.numTrees, [10, 50])\
.addGrid(random_forest.maxDepth, [5, 10])\
.build()
evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC", labelCol="label")
cross_validator = CrossValidator(estimator=random_forest,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=5,
seed=42)
cv_model = cross_validator.fit(training_data)
best_model = cv_model.bestModel
predictions = best_model.transform(test_data)
predictions.select("features", "label", "probability", "prediction").show()
auc_roc = evaluator.evaluate(best_model.transform(test_data))
print("AUC-ROC:", auc_roc)
Upvotes: 0
Views: 15