Willard
Willard

Reputation: 522

How to compute bias and variance of pyspark linear regression model

I am trying to calculate the bias and variance of a pyspark linear regression model. I start with a 3rd degree polynomial, add some noise, and fit a linear regression model with varying degrees of polynomial expansion. The goal is to show that bias decreases and variance increases as the degree of the polynomial expansion increases. In my code below, the model bias remains constant because the mean of the prediction is the same for polynomial degrees 1, 2, and 3. I must be calculating bias wrong and I'm also wondering if I'm calculating variance correctly. Can someone verify that I'm calculating bias correctly (or not) and help me figure out why the bias remains the same regardless of the polynomial expansion degree. All comments are welcome about anything incorrect in the code.

    from pyspark.sql import SparkSession
    from pyspark.ml import feature, regression, Pipeline
    from pyspark.sql import functions as fn
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    spark = SparkSession.builder.getOrCreate()
    sc = spark.sparkContext

    # create numpy arrays for x and y data
    x = np.linspace(-15, 15, 250)
    y = 10 + 5*x + 0.5*np.square(x) - 0.1*np.power(x,3)
    reducible_error = np.random.uniform(-50, 50, len(x))
    irreducible_error = np.random.normal(0, 8, len(x))
    y_noise = y + reducible_error + irreducible_error

    # plot x and y data
    %matplotlib inline
    plt.figure()
    plt.plot(x,y, c='r', label="y")
    plt.scatter(x, y_noise, label="y_noise")
    plt.legend()
    plt.title("10 + 5x + 0.5x^2 - 0.1x^3")
    plt.xlabel("x")
    plt.ylabel("y, y_noise")

    # create a pandas dataframe from the x, y, y_hat data arrays
    pd_df = pd.DataFrame({'x': x, 'y_noise': y_noise, 'y': y}, columns=['x', 'y_noise', 'y'])

    # create a spark dataframe from the pandas dataframe
    df = spark.createDataFrame(pd_df)
    df.show()

    def get_bias_squared(df):
        f_hat_mean = np.mean(df['prediction'])
        return np.mean(np.square(df['y_noise'] - f_hat_mean))

    def get_variance(df):
        f_hat_mean = np.mean(df['prediction'])
        diff = df['prediction'] - f_hat_mean
        return np.mean(np.square(diff))    

    def plot_poly_expansion(n, df, lambda_reg=0., alpha_reg=0.):
        # create the pipeline
        va = feature.VectorAssembler(inputCols=['x'], outputCol='features')
        pe = feature.PolynomialExpansion(degree=n, inputCol='features', outputCol='poly_features')
        lr = regression.LinearRegression(featuresCol='poly_features', labelCol='y_noise', regParam=lambda_reg, 
                                         elasticNetParam=alpha_reg)
        pipe = Pipeline(stages=[va, pe, lr]).fit(df)

        # fit the pipeline
        fit_df = pipe.transform(df)

        # convert the fitted spark dataframe to pandas and plot predicted vs. actual
        fit_pd_df = fit_df.toPandas()

        # display(fit_pd_df.head())
        fit_pd_df.plot(x='x', y=['y', 'y_noise', 'prediction'])
        plt.title("Polynomial degree = %s\nBias = %s, Variance = %s" % (i, get_bias_squared(fit_pd_df), 
                                                                        get_variance(fit_pd_df)))
        plt.xlabel("x")
        plt.ylabel("y")

        return fit_pd_df

    for i in np.arange(1, 4):
        plot_poly_expansion(float(i), df)

Upvotes: 2

Views: 659

Answers (1)

Robert Long
Robert Long

Reputation: 6802

Your calculations look OK, I would just make a small modification:

def get_bias_squared(df, true_function):
    # The true function should be a callable that gives the true y value for each x
    f_hat_mean = np.mean(df['prediction'])
    true_values = true_function(df['x'])
    return np.mean(np.square(f_hat_mean - true_values))

def get_variance(df):
    f_hat_mean = np.mean(df['prediction'])
    diff = df['prediction'] - f_hat_mean
    return np.mean(np.square(diff))

def true_function(x):
    # This is your true function without noise
    return 10 + 5*x + 0.5*np.square(x) - 0.1*np.power(x,3)

Polynomial expansion typically reduces bias and increases variance as the degree of the polynomial increases, because the model gets more flexible and begins to fit the noise in the training data (ie. overfitting).

Your code computes bias and variance based on a single model realisation. To accurately compute bias and variance, fit the model numerous times on different subsets of the data (or using different bootstrapped samples of the data) and then compare the variability of the predictions for each data point across different model realisations.

However, in actual applications, particularly with big data tools like PySpark, this can be computationally expensive because it requires fitting the model many times. For big datasets and models, bias and variance are frequently assessed indirectly using other methods, such as cross-validation.

Upvotes: 0

Related Questions