n.mathfreak
n.mathfreak

Reputation: 145

Matplotlib: shape mismatch: objects cannot be broadcast to a single shape

I have a dataframe that looks like this (it is obviously much bigger):

id     points isAvailable frequency   Score
abc1   325    0           93          0.01
def2   467    1           80          0.59
ghi3   122    1           90          1 
jkl4   546    1           84          0
mno5   355    0           93          0.99

I want to see how much the features points, isAvailable and frequency influence the Score. I want to use Random Forests like in this example:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
#from sklearn.inspection import permutation_importance
#import shap
from matplotlib import pyplot as plt

plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})

list_of_columns = ['points','isAvailable', 'frequency']
X = df[list_of_columns]
target_column = 'Score'
y = df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
rf.feature_importances_ #the array below is the output 
>>> array([0.44326132, 0.01666047, 0.        , 0.5400782 ])

plt.barh(df.columns, rf.feature_importances_)

On the last line I get the following error: ValueError: shape mismatch: objects cannot be broadcast to a single shape. Should I have created those columns in the beginning? Is there a problem in the (bigger) data?

Upvotes: 1

Views: 2745

Answers (1)

tdy
tdy

Reputation: 41477

The rf model is trained on X which is only a subset of df, so the feature importances should be plotted against X.columns (or list_of_columns) instead of df.columns:

plt.barh(X.columns, rf.feature_importances_)

feature importance bar plot

Upvotes: 1

Related Questions