Saved Random Forest model produces different results on the same dataset

I'm having trouble reproducing results with a Random Forest model saved on disk and using the exact same dataset for prediction. In other words I train a model with dataset A and persist it on my local machine, then I load it and use it for predicting dataset B, every time I predict dataset B I get different results.

I'm aware of the randomness involved in a Random Forest classifier, however as far as I understand this randomness is during training, once the model is created the prediction shouldn't change if you use the same data for prediction.

The training script has the following structure:

df_train = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \

#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_train.dtypes:
    if variable[1] == 'string' :

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_train =indexer.fit(df_train).transform(df_train)
    df_train = df_train.drop(indexer.getInputCol())
indexed_cols = []
for variable in df_train.columns:
    if variable.endswith("_indexed"):

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_train = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_train = one_hot_encoder_estimator_train.fit(df_train)
    df_train = encoder_model_train.transform(df_train)
    df_train = df_train.drop(inputCol)

inputCols = [x for x in df_train.columns if x != "id" and x != "churn"]

vector_assembler_train = VectorAssembler(

df_train = vector_assembler_train.transform(df_train)

df_train = df_train.select('churn', 'features', 'id')

df_train_1 = df_train.filter(df_train['churn'] == 0).sample(withReplacement=False, fraction=0.3, seed=7)
df_train_2 = df_train.filter(df_train['churn'] == 1).sample(withReplacement=True, fraction=20.0, seed=7)
df_train = df_train_1.unionAll(df_train_2) 

rf = RandomForestClassifier(labelCol="churn", featuresCol="features")
  paramGrid = ParamGridBuilder() \
      .addGrid(rf.numTrees, [100]) \
      .addGrid(rf.maxDepth, [15]) \
      .addGrid(rf.maxBins, [32]) \
      .addGrid(rf.featureSubsetStrategy, ['onethird']) \
      .addGrid(rf.subsamplingRate, [1.0])\
      .addGrid(rf.minInfoGain, [0.0])\
      .addGrid(rf.impurity, ['gini']) \
      .addGrid(rf.minInstancesPerNode, [1]) \
      .addGrid(rf.seed, [10]) \

  evaluator = BinaryClassificationEvaluator(

  crossval = CrossValidator(estimator=rf,
  model = crossval.fit(df_train)

The testing script is as follows:

df_test = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_test.dtypes:
    if variable[1] == 'string' :

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_test =indexer.fit(df_test).transform(df_test)
    df_test = df_test.drop(indexer.getInputCol())
indexed_cols = []
for variable in df_test.columns:
    if variable.endswith("_indexed"):

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_test = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_test= one_hot_encoder_estimator_test.fit(df_test)
    df_test= encoder_model_test.transform(df_test)
    df_test= df_test.drop(inputCol)

inputCols = [x for x in df_test.columns if x != "id" and x != "churn"]

vector_assembler_test = VectorAssembler(

df_test = vector_assembler_test.transform(df_test)

df_test = df_test.select('churn', 'features', 'id')

model = CrossValidatorModel.load("C:/myModel")

result = model.transform(df_test)

areaUnderROC = evaluator.evaluate(result)

tp = result.filter("prediction == 1.0 AND churn == 1").count()
tn = result.filter("prediction == 0.0 AND churn == 0").count()
fp = result.filter("prediction == 1.0 AND churn == 0").count()
fn = result.filter("prediction == 0.0 AND churn == 1").count()

Every time I run the testing script the AUC and Confusion Matrix are always different. I use Spark 2.4.5 and Python 3.7 on a Windows 10 machine. Any suggestion or idea is very much appreciated.

Edit: The problem is related to the StringIndexer/One-Hot Encoding steps. When I use only numerical variables I'm able to reproduce the results. The question is still open since I cant explain why this happens.

In my experience, this issue is because you are re-evaluating the OneHotEncoder in test.

Here is how OneHotEncoding works, from the docs:

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

Therefore, each time the data is different (which is naturally the case in train vs. test), the values produced in the vector by the One Hot Encoder are different.

You should add the OneHotEncoder to a pipeline together with your trained model, fit it and then save, then load it again in test. This way the One Hot Encoded values are guaranteed to be matched to the same values each time data is run through the pipeline.

More details on saving and loading pipelines can be found in the documentation.

