Kas
Kas

Reputation: 315

predict_proba on pyspark testing dataframe

I am very new to pyspark and need to perform prediction. I've already done everything but in python, because the data I have to apply the logic is huge - I need to transform everything to pyspark.

The problem is: I have 2 dataframes, first dataframe is for training purposes with Y column and the second one is for testing (from my s3 bucket).

  1. I already cleaned and prepared training dataframe in python and converted it to pyspark df:

    from pyspark.sql import SparkSession

    import pandas as pd

    Create a SparkSession:

    spark = SparkSession.builder.appName('myapp').getOrCreate()

    Load the Pandas DataFrame:

    pdf = dataframe #dataframe is the name of my pandas df

    Convert the Pandas DataFrame to a PySpark DataFrame:

    df_spark = spark.createDataFrame(pdf)

  2. I selected training columns - for training purposes:

    y = df_spark.select("Y")

    X = df_spark.select([c for c in dataframe.columns if c != "Y"])

    training_columns = X.columns

  3. I followed the tutorial where I used VectorAssembler:

    from pyspark.ml.feature import VectorAssembler

    df_spark.columns

    assembler = VectorAssembler(inputCols=training_columns, outputCol='features')

    output = assembler.transform(df_spark)

    final_data = output.select('features', 'Y')

  4. I trained the model - Random Forest

    from pyspark.ml.classification import RandomForestClassifier

    from pyspark.ml import Pipeline

    Set the hyperparameters:

    n_estimators = 100

    max_depth = 5

    Create the model:

    rfc = RandomForestClassifier( featuresCol='features', labelCol='Y', numTrees=n_estimators, maxDepth=max_depth )

    Fit the model on the training data:

    model = rfc.fit(final_data)

  5. I checked model evaluation on training data:

    predictions = model.transform(final_data)

    from pyspark.ml.evaluation import BinaryClassificationEvaluator

    my_binary_eval = BinaryClassificationEvaluator(labelCol='Y')

    print(my_binary_eval.evaluate(predictions))

  6. Now is the moment where I want to apply the model on a different pyspark dataframe - test data, with a lot of records = df_to_predict

df_to_predict might have slightly different set of columns comparing to training data - as I was dropping columns with no variance. Hence first I need to apply the columns from training set:

df_to_predict = df_to_predict.select(training_columns)

next I would like to apply the model and do predict_proba <- this is in sklearn package - I do not know how to apply/ convert this code to pyspark:

df_to_predict["Predicted_Y_Probability"] = model.predict_proba(df_to_predict)[::, 1]

Upvotes: 0

Views: 1018

Answers (1)

Amir Hossein Shahdaei
Amir Hossein Shahdaei

Reputation: 1256

You can use UDF to make your custom functions. Check this answer:

Unable to make prediction with Sklearn model on pyspark dataframe

Upvotes: -1

Related Questions