jack_f
jack_f

Reputation: 140

ValueError: Number of features of the model must match the input

I'm getting this error when trying to predict using a model I built in scikit learn. I know that there are a bunch of questions about this but mine seems different from them because I am wildly off between my input and model features. Here is my code for training my model (FYI the .csv file has 45 columns with one being the known value):

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
from sklearn.externals import joblib


df = pd.read_csv("Cinderella.csv")


features_df = pd.get_dummies(df, columns=['Overall_Sentiment', 'Word_1','Word_2','Word_3','Word_4','Word_5','Word_6','Word_7','Word_8','Word_9','Word_10','Word_11','Word_1','Word_12','Word_13','Word_14','Word_15','Word_16','Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41', 'Word_42', 'Word_43'], dummy_na=True)

del features_df['Slope']

X = features_df.as_matrix()
y = df['Slope'].as_matrix()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = ensemble.GradientBoostingRegressor(
    n_estimators=500,
    learning_rate=0.01,
    max_depth=5,
    min_samples_leaf=3,
    max_features=0.1,
    loss='lad'
)

model.fit(X_train, y_train)

joblib.dump(model, 'slope_from_sentiment_model.pkl')

mse = mean_absolute_error(y_train, model.predict(X_train))

print("Training Set Mean Absolute Error: %.4f" % mse)

mse = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.4f" % mse)

Here is my code for the actual prediction using a different .csv file (this has 44 columns because it doesn't have any values):

from sklearn.externals import joblib
import pandas


model = joblib.load('slope_from_sentiment_model.pkl')

df = pandas.read_csv("Slaughterhouse_copy.csv")


features_df = pandas.get_dummies(df, columns=['Overall_Sentiment','Word_1', 'Word_2', 'Word_3', 'Word_4', 'Word_5', 'Word_6', 'Word_7', 'Word_8', 'Word_9', 'Word_10', 'Word_11', 'Word_12', 'Word_13', 'Word_14', 'Word_15', 'Word_16', 'Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41','Word_42','Word_43'], dummy_na=True)

predicted_slopes = model.predict(features_df)

When I run the prediction file I get:

ValueError: Number of features of the model must match the input. Model n_features is 146 and input n_features is 226.

If anyone could help me it would be greatly appreciated! Thanks in advance!

Upvotes: 11

Views: 62342

Answers (5)

The size of the training data(excluding labels,however) which you fit to the model should be same as the size of the data which you are going to predict

Upvotes: 1

Michael Gardner
Michael Gardner

Reputation: 1803

You can utilize the Categorical Dtype to apply null values to unseen data.

Input:

import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype

# Create Example Data
train = pd.DataFrame({"text":["A", "B", "C", "D", 'F', np.nan]})
test = pd.DataFrame({"text":["D", "D", np.nan,"B", "E", "T"]})

# Convert columns to category dtype and specify categories for test set
train['text'] = train['text'].astype('category')
test['text'] = test['text'].astype(CategoricalDtype(categories=train['text'].cat.categories))

# Create Dummies
pd.get_dummies(test['text'], dummy_na=True)

Output:

| A | B | C | D | F | nan |
|---|---|---|---|---|-----|
| 0 | 0 | 0 | 1 | 0 | 0   |
| 0 | 0 | 0 | 1 | 0 | 0   |
| 0 | 0 | 0 | 0 | 0 | 1   |
| 0 | 1 | 0 | 0 | 0 | 0   |
| 0 | 0 | 0 | 0 | 0 | 1   |
| 0 | 0 | 0 | 0 | 0 | 1   |

Upvotes: 0

code-on-treehouse
code-on-treehouse

Reputation: 11

Below correction to original answer from Scratch'N'Purr would help solve issues one might face using string as value for new inserted column 'label' -
train_df = pd.read_csv("Cinderella.csv") train_df['label'] = 1

    score_df = pandas.read_csv("Slaughterhouse_copy.csv")
    score_df['label'] = 2

    # Concat
    concat_df = pd.concat([train_df , score_df])

    # Create your dummies
    features_df = pd.get_dummies(concat_df)

    # Split your data
    train_df = features_df[features_df['label'] == '1]
    score_df = features_df[features_df['label'] == '2]
    ...

Upvotes: 1

Akson
Akson

Reputation: 691

I tried the method suggested here and ended up with hot encoding the label column as well,and in the dataframe it is shown as 'label_test' and 'label_train' so just a heads up try this post get_dummies:

train_df = feature_df[feature_df['label_train'] == 1]
test_df = feature_df[feature_df['label_test'] == 0]
train_df = train_df.drop(['label_train', 'label_test'], axis=1)
test_df = test_df.drop(['label_train', 'label_test'], axis=1)

Upvotes: 4

Scratch'N'Purr
Scratch'N'Purr

Reputation: 10399

The reason you're getting the error is due to the different distinct values in your features where you're generating the dummy values with get_dummies.

Let's suppose the Word_1 column in your training set has the following distinct words: the, dog, jumps, roof, off. That's 5 distinct words so pandas will generate 5 features for Word_1. Now, if your scoring dataset has a different number of distinct words in the Word_1 column, then you're going to get a different number of features.

How to fix:

You'll want to concatenate your training and scoring datasets using concat, apply get_dummies, and then split your datasets. That'll ensure you have captured all the distinct values in your columns. Given that you're using two different csv's, you probably want to generate a column that specifies your training vs scoring dataset.

Example solution:

train_df = pd.read_csv("Cinderella.csv")
train_df['label'] = 'train'

score_df = pandas.read_csv("Slaughterhouse_copy.csv")
score_df['label'] = 'score'

# Concat
concat_df = pd.concat([train_df , score_df])

# Create your dummies
features_df = pd.get_dummies(concat_df, columns=['Overall_Sentiment', 'Word_1','Word_2','Word_3','Word_4','Word_5','Word_6','Word_7','Word_8','Word_9','Word_10','Word_11','Word_1','Word_12','Word_13','Word_14','Word_15','Word_16','Word_17','Word_18','Word_19','Word_20','Word_21','Word_22','Word_23','Word_24','Word_25','Word_26','Word_27','Word_28','Word_29','Word_30','Word_31','Word_32','Word_33','Word_34','Word_35','Word_36','Word_37','Word_38','Word_39','Word_40','Word_41', 'Word_42', 'Word_43'], dummy_na=True)

# Split your data
train_df = features_df[features_df['label'] == 'train']
score_df = features_df[features_df['label'] == 'score']

# Drop your labels
train_df = train_df.drop('label', axis=1)
score_df = score_df.drop('label', axis=1)

# Now delete your 'slope' feature, create your features matrix, and create your model as you have already shown in your example
...

Upvotes: 24

Related Questions