GluonCollision
GluonCollision

Reputation: 1

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.

I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:

RuntimeError: No training summary available for this GeneralizedLinearRegressionModel

Why is this?

The code for training was as such:

glm_fare = GeneralizedLinearRegression(
            labelCol="Fare", 
            featuresCol="features", 
            predictionCol='prediction',
            family='gamma',
            link='log',
            weightCol='wght',
            maxIter=20
            )
    glm_fit = glm_fare.fit(training_df)

    glm_fit.summary

Upvotes: 0

Views: 1768

Answers (3)

Wei Xu
Wei Xu

Reputation: 16

Make sure your input variables for one hot encoder starts from 0. One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Upvotes: 0

hello123
hello123

Reputation: 51

Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.

The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.

A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.

Upvotes: 4

pissall
pissall

Reputation: 7409

It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.

However you can do an initial check to avoid the error:

glm_fit.hasSummary() which is a public boolean method. Using it as

if glm_fit.hasSummary():
    print(glm_fit.summary)

Here is a direct like to the Pyspark source code and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown

Upvotes: 0

Related Questions