Fede
Fede

Reputation: 13

The feature names should match those that were passed during fit

Im trying to calculate the r squared value after the creation of a model using sklearn linear regression.

Im simply

  1. importing a csv dataset
  2. filtering the interesting columns
  3. splitting the dataset in train and test
  4. creating the model
  5. making a prediction on the test
  6. calculating the r squared in order to see how good is the model to fit the test dataset

the dataset is taken from https://www.kaggle.com/datasets/jeremylarcher/american-house-prices-and-demographics-of-top-cities

the code is as following

''' Lets verify if there s a correlation between price and beds number of bathroom'''

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.read_csv('data/American_Housing_Data_20231209.csv')

df_interesting_columns = df[['Beds', 'Baths', 'Price']]

independent_variables = df_interesting_columns[['Beds', 'Baths']]
dependent_variable = df_interesting_columns[['Price']]

X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

print(model.score(y_test, prediction))

but i get the error

ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:

what am I doing wrong?

Upvotes: 1

Views: 7146

Answers (2)

tyatyaboy21
tyatyaboy21

Reputation: 23

The mismatch is related to the format when you pass the data and the output of it. Usually, we pass the DataFrame and it outputs the numpy array, which the name of the columns will be different from the input. Thus, you need to make sure the output of it needs to be in the same format. So, apply the following configuration from sklearn to the training function and the testing function. Check if it solves.

sklearn.set_config(transform_output="pandas")

Upvotes: 0

Corralien
Corralien

Reputation: 120509

Your last line is wrong. You misunderstood the score method. score take X and y as parameter not the y_true and y_pred

Try:

from sklearn.metrics import r2_score

print(r2_score(y_test, prediction))
# 0.24499127100887863

Or with the score method:

print(model.score(X_test, y_test))
# 0.24499127100887863

Upvotes: 0

Related Questions