Reputation: 315
I am very new to pyspark and need to perform prediction. I've already done everything but in python, because the data I have to apply the logic is huge - I need to transform everything to pyspark.
The problem is: I have 2 dataframes, first dataframe is for training purposes with Y column and the second one is for testing (from my s3 bucket).
I already cleaned and prepared training dataframe in python and converted it to pyspark df:
from pyspark.sql import SparkSession
import pandas as pd
Create a SparkSession:
spark = SparkSession.builder.appName('myapp').getOrCreate()
Load the Pandas DataFrame:
pdf = dataframe #dataframe is the name of my pandas df
Convert the Pandas DataFrame to a PySpark DataFrame:
df_spark = spark.createDataFrame(pdf)
I selected training columns - for training purposes:
y = df_spark.select("Y")
X = df_spark.select([c for c in dataframe.columns if c != "Y"])
training_columns = X.columns
I followed the tutorial where I used VectorAssembler:
from pyspark.ml.feature import VectorAssembler
df_spark.columns
assembler = VectorAssembler(inputCols=training_columns, outputCol='features')
output = assembler.transform(df_spark)
final_data = output.select('features', 'Y')
I trained the model - Random Forest
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
Set the hyperparameters:
n_estimators = 100
max_depth = 5
Create the model:
rfc = RandomForestClassifier( featuresCol='features', labelCol='Y', numTrees=n_estimators, maxDepth=max_depth )
Fit the model on the training data:
model = rfc.fit(final_data)
I checked model evaluation on training data:
predictions = model.transform(final_data)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_binary_eval = BinaryClassificationEvaluator(labelCol='Y')
print(my_binary_eval.evaluate(predictions))
Now is the moment where I want to apply the model on a different pyspark dataframe - test data, with a lot of records = df_to_predict
df_to_predict might have slightly different set of columns comparing to training data - as I was dropping columns with no variance. Hence first I need to apply the columns from training set:
df_to_predict = df_to_predict.select(training_columns)
next I would like to apply the model and do predict_proba <- this is in sklearn package - I do not know how to apply/ convert this code to pyspark:
df_to_predict["Predicted_Y_Probability"] = model.predict_proba(df_to_predict)[::, 1]
Upvotes: 0
Views: 1018
Reputation: 1256
You can use UDF to make your custom functions. Check this answer:
Unable to make prediction with Sklearn model on pyspark dataframe
Upvotes: -1