The feature names should match those that were passed during fit

Question

Im trying to calculate the r squared value after the creation of a model using sklearn linear regression.

Im simply

importing a csv dataset
filtering the interesting columns
splitting the dataset in train and test
creating the model
making a prediction on the test
calculating the r squared in order to see how good is the model to fit the test dataset

the dataset is taken from https://www.kaggle.com/datasets/jeremylarcher/american-house-prices-and-demographics-of-top-cities

the code is as following

''' Lets verify if there s a correlation between price and beds number of bathroom'''

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.read_csv('data/American_Housing_Data_20231209.csv')

df_interesting_columns = df[['Beds', 'Baths', 'Price']]

independent_variables = df_interesting_columns[['Beds', 'Baths']]
dependent_variable = df_interesting_columns[['Price']]

X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

print(model.score(y_test, prediction))

but i get the error

ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:

Price Feature names seen at fit time, yet now missing:
Baths
Beds

what am I doing wrong?

Corralien · Accepted Answer

Your last line is wrong. You misunderstood the score method. score take X and y as parameter not the y_true and y_pred

Try:

from sklearn.metrics import r2_score

print(r2_score(y_test, prediction))
# 0.24499127100887863

Or with the score method:

print(model.score(X_test, y_test))
# 0.24499127100887863

The feature names should match those that were passed during fit

Answers (2)

Related Questions