Reputation: 13
Im trying to calculate the r squared value after the creation of a model using sklearn linear regression.
Im simply
the dataset is taken from https://www.kaggle.com/datasets/jeremylarcher/american-house-prices-and-demographics-of-top-cities
the code is as following
''' Lets verify if there s a correlation between price and beds number of bathroom'''
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/American_Housing_Data_20231209.csv')
df_interesting_columns = df[['Beds', 'Baths', 'Price']]
independent_variables = df_interesting_columns[['Beds', 'Baths']]
dependent_variable = df_interesting_columns[['Price']]
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(model.score(y_test, prediction))
but i get the error
ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:
what am I doing wrong?
Upvotes: 1
Views: 7146
Reputation: 23
The mismatch is related to the format when you pass the data and the output of it. Usually, we pass the DataFrame and it outputs the numpy array, which the name of the columns will be different from the input. Thus, you need to make sure the output of it needs to be in the same format. So, apply the following configuration from sklearn to the training function and the testing function. Check if it solves.
sklearn.set_config(transform_output="pandas")
Upvotes: 0
Reputation: 120509
Your last line is wrong. You misunderstood the score
method. score
take X
and y
as parameter not the y_true
and y_pred
Try:
from sklearn.metrics import r2_score
print(r2_score(y_test, prediction))
# 0.24499127100887863
Or with the score
method:
print(model.score(X_test, y_test))
# 0.24499127100887863
Upvotes: 0