Reputation: 145
I have a dataframe that looks like this (it is obviously much bigger):
id points isAvailable frequency Score
abc1 325 0 93 0.01
def2 467 1 80 0.59
ghi3 122 1 90 1
jkl4 546 1 84 0
mno5 355 0 93 0.99
I want to see how much the features points
, isAvailable
and frequency
influence the Score
. I want to use Random Forests like in this example:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
#from sklearn.inspection import permutation_importance
#import shap
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
list_of_columns = ['points','isAvailable', 'frequency']
X = df[list_of_columns]
target_column = 'Score'
y = df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
rf.feature_importances_ #the array below is the output
>>> array([0.44326132, 0.01666047, 0. , 0.5400782 ])
plt.barh(df.columns, rf.feature_importances_)
On the last line I get the following error: ValueError: shape mismatch: objects cannot be broadcast to a single shape
. Should I have created those columns in the beginning? Is there a problem in the (bigger) data?
Upvotes: 1
Views: 2745
Reputation: 41477
The rf
model is trained on X
which is only a subset of df
, so the feature importances should be plotted against X.columns
(or list_of_columns
) instead of df.columns
:
plt.barh(X.columns, rf.feature_importances_)
Upvotes: 1