predict_proba on pyspark testing dataframe

Question

I am very new to pyspark and need to perform prediction. I've already done everything but in python, because the data I have to apply the logic is huge - I need to transform everything to pyspark.

The problem is: I have 2 dataframes, first dataframe is for training purposes with Y column and the second one is for testing (from my s3 bucket).

I already cleaned and prepared training dataframe in python and converted it to pyspark df:

from pyspark.sql import SparkSession

import pandas as pd

Create a SparkSession:

spark = SparkSession.builder.appName('myapp').getOrCreate()

Load the Pandas DataFrame:

pdf = dataframe #dataframe is the name of my pandas df

Convert the Pandas DataFrame to a PySpark DataFrame:

df_spark = spark.createDataFrame(pdf)
I selected training columns - for training purposes:

y = df_spark.select("Y")

X = df_spark.select([c for c in dataframe.columns if c != "Y"])

training_columns = X.columns
I followed the tutorial where I used VectorAssembler:

from pyspark.ml.feature import VectorAssembler

df_spark.columns

assembler = VectorAssembler(inputCols=training_columns, outputCol='features')

output = assembler.transform(df_spark)

final_data = output.select('features', 'Y')
I trained the model - Random Forest

from pyspark.ml.classification import RandomForestClassifier

from pyspark.ml import Pipeline

Set the hyperparameters:

n_estimators = 100

max_depth = 5

Create the model:

rfc = RandomForestClassifier( featuresCol='features', labelCol='Y', numTrees=n_estimators, maxDepth=max_depth )

Fit the model on the training data:

model = rfc.fit(final_data)
I checked model evaluation on training data:

predictions = model.transform(final_data)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

my_binary_eval = BinaryClassificationEvaluator(labelCol='Y')

print(my_binary_eval.evaluate(predictions))
Now is the moment where I want to apply the model on a different pyspark dataframe - test data, with a lot of records = df_to_predict

df_to_predict might have slightly different set of columns comparing to training data - as I was dropping columns with no variance. Hence first I need to apply the columns from training set:

df_to_predict = df_to_predict.select(training_columns)

next I would like to apply the model and do predict_proba <- this is in sklearn package - I do not know how to apply/ convert this code to pyspark:

df_to_predict["Predicted_Y_Probability"] = model.predict_proba(df_to_predict)[::, 1]

predict_proba on pyspark testing dataframe

Answers (1)

Related Questions