Reputation: 41
I built a logistic regression model using a pipeline flow to the one listed by databricks. https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
the features (numeric and string features) were encoded using OneHotEncoderEstimator
and then transformed using standard scaler.
I would like to know how to map the weights(coefficients) obtained from logistic regression to the feature names in the original dataframe.
In other words, how to get the corresponding features to the weights or the coefficients obtained from the model
Thank you
I have tried to extract the features from the lrModel.schema, which gave a list of structField
showing the features
I tried to extract the features from the schema and map to the weights but not successful
from pyspark.ml.classification import LogisticRegression
# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)
# Train model with Training Data
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(trainingData)
LRschema = predictions.schema
the expected outcome from the extraction a list of tuples(feature weight, feature name)
Upvotes: 4
Views: 4753
Reputation: 209
None of the above solutions seemed to work for my case. My model has a mix of numeric and binary variables. Also all of the data transformations and model validation are connected in one long pipeline so the only place I could see the schema is in the predictions data. I was able to hack together some code to iterate through the schema and make a dictionary from all of the variable names. Then connect this to the coefficients.
# Extract the coefficients on each of the variables
coeff = mymodel.coefficients.toArray().tolist()
# Loop through the features to extract the original column names. Store in the var_index dictionary
var_index = dict()
for variable_type in ['numeric', 'binary']:
for variable in predictions.schema["features"].metadata["ml_attr"]["attrs"][variable_type]:
print("Found variable:", variable)
idx = variable['idx']
name = variable['name']
var_index[idx] = name # Add the name to the dictionary
# Loop through all of the variables found and print out the associated coefficients
for i in range(len(var_index)):
print(i, var_index[i], coeff[i])
Upvotes: 1
Reputation: 631
Lets say you have a Logistic Regression to work with, this Pandas workaround will give you the result.
lr = LogisticRegression(labelCol="label", featuresCol="features",maxIter=50,threshold=0.5)
lr_model=lr.fit(train_set)
print("Intercept: " + str(lr_model.intercept))
pd.DataFrame({'coefficients':lr_model.coefficients, 'feature':list(pd.DataFrame(train_set.schema["features"].metadata["ml_attr"]["attrs"]['numeric']).sort_values('idx')['name'])})
Upvotes: 0
Reputation: 2500
Is not a direct output from LogisticRegression but can be obtained using following function I use:
def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
test = model.transform(dataset)
weights = model.coefficients
print('This is model weights: \n', weights)
weights = [(float(w),) for w in weights] # convert numpy type to float, and to tuple
if excludedCols == None:
feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
else:
feature_col = [f for f in test.schema.names if f not in excludedCols]
if len(weights) == len(feature_col):
weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
else:
print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
return weightsDF
results = ExtractFeatureCoeficient(lr_model, trainingData)
results.show()
This will generated a spark dataframe with following fields:
+--------------------+--------------------+
| Coeficients| FeatureName|
+--------------------+--------------------+
|[0.15834847825223...| name |
| [0.0]| lat |
+--------------------+--------------------+
Or you can fit a GML model as follow:
model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")
# Train model. This also runs the indexer.
models = glmModel.fit(trainingData)
# then get summary of the model:
summary = model.summary
print(summary)
Generating the output:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -1.3079 0.0705 -18.5549 0.0000
name 0.1248 0.0158 7.9129 0.0000
lat 0.0239 0.0209 1.1455 0.2520
Upvotes: 1