How to map the coefficient obtained from logistic regression model to the feature names in pyspark

I built a logistic regression model using a pipeline flow to the one listed by databricks. https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

the features (numeric and string features) were encoded using OneHotEncoderEstimator and then transformed using standard scaler.

I would like to know how to map the weights(coefficients) obtained from logistic regression to the feature names in the original dataframe.

In other words, how to get the corresponding features to the weights or the coefficients obtained from the model

Thank you

I have tried to extract the features from the lrModel.schema, which gave a list of structField showing the features

I tried to extract the features from the schema and map to the weights but not successful

from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)

# Train model with Training Data

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(trainingData)

LRschema = predictions.schema

the expected outcome from the extraction a list of tuples(feature weight, feature name)

Upvotes: 4

Answers (3)

user3276159

Reputation: 209

None of the above solutions seemed to work for my case. My model has a mix of numeric and binary variables. Also all of the data transformations and model validation are connected in one long pipeline so the only place I could see the schema is in the predictions data. I was able to hack together some code to iterate through the schema and make a dictionary from all of the variable names. Then connect this to the coefficients.

# Extract the coefficients on each of the variables
coeff = mymodel.coefficients.toArray().tolist()

# Loop through the features to extract the original column names. Store in the var_index dictionary
var_index = dict()
for variable_type in ['numeric', 'binary']:
    for variable in predictions.schema["features"].metadata["ml_attr"]["attrs"][variable_type]:
        print("Found variable:", variable)
        idx = variable['idx']
        name = variable['name']
        var_index[idx] = name      # Add the name to the dictionary

# Loop through all of the variables found and print out the associated coefficients
for i in range(len(var_index)):
    print(i, var_index[i], coeff[i])

Upvotes: 1

Nidhi

Reputation: 631

Lets say you have a Logistic Regression to work with, this Pandas workaround will give you the result.

lr = LogisticRegression(labelCol="label", featuresCol="features",maxIter=50,threshold=0.5)

lr_model=lr.fit(train_set)

print("Intercept: " + str(lr_model.intercept))  

pd.DataFrame({'coefficients':lr_model.coefficients, 'feature':list(pd.DataFrame(train_set.schema["features"].metadata["ml_attr"]["attrs"]['numeric']).sort_values('idx')['name'])})

Upvotes: 0

n1tk

Reputation: 2500

Is not a direct output from LogisticRegression but can be obtained using following function I use:

def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
    test = model.transform(dataset)
    weights = model.coefficients
    print('This is model weights: \n', weights)
    weights = [(float(w),) for w in weights]  # convert numpy type to float, and to tuple
    if excludedCols == None:
        feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
    else:
        feature_col = [f for f in test.schema.names if f not in excludedCols]
    if len(weights) == len(feature_col):
        weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
    else:
        print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
    
    return weightsDF

results = ExtractFeatureCoeficient(lr_model, trainingData)

results.show()

This will generated a spark dataframe with following fields:

+--------------------+--------------------+
|         Coeficients|         FeatureName|
+--------------------+--------------------+
|[0.15834847825223...|    name            |
|               [0.0]|  lat               |
+--------------------+--------------------+

Or you can fit a GML model as follow:

model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")

# Train model.  This also runs the indexer.
models = glmModel.fit(trainingData)

# then get summary of the model:

summary = model.summary
print(summary)

Generating the output:

Coefficients:
        Feature       Estimate Std Error  T Value P Value
    (Intercept)       -1.3079    0.0705 -18.5549  0.0000
    name               0.1248    0.0158   7.9129  0.0000
    lat                0.0239    0.0209   1.1455  0.2520

Upvotes: 1

How to map the coefficient obtained from logistic regression model to the feature names in pyspark

Answers (3)

Related Questions