random student
random student

Reputation: 775

Confidence score too low

Im wondering why the model score is very low, only 0.13, i already make sure the data is clean, scaled, and also have high correlation between each features but the model score using linear regression is very low, why is this happening and how to solve this? this is my code

import numpy as np 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing


path = r"D:\python projects\avocado.csv"
df = pd.read_csv(path)
df = df.reset_index(drop=True)
df.set_index('Date', inplace=True)
df = df.drop(['Unnamed: 0','year','type','region','AveragePrice'],1)
df.rename(columns={'4046':'Small HASS sold',
                          '4225':'Large HASS sold',
                          '4770':'XLarge HASS sold'}, 
                 inplace=True)
print(df.head)

sns.heatmap(df.corr())
sns.pairplot(df)
df.plot()
_=plt.xticks(rotation=20)

forecast_line = 35
df['target'] = df['Total Volume'].shift(-forecast_line)

X = np.array(df.drop(['target'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_line:]
X = X[:-forecast_line]
df.dropna(inplace=True)


y = np.array(df['target'])

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lr = LinearRegression()
lr.fit(X_train,y_train)
confidence = lr.score(X_test,y_test)
print(confidence)

this is the link of the dataset i use

https://www.kaggle.com/neuromusic/avocado-prices

Upvotes: 1

Views: 143

Answers (1)

PV8
PV8

Reputation: 6260

So the score function you are using is:

Return the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

So as you realise you are already above the the constant prediction.

My advice try to plot your data, to see what kind of regression you should use. Here you can see an overview which type of linear regression are available: https://scikit-learn.org/stable/modules/linear_model.html

Logistic regression makes sense if your data has a logistic curve, which means that your points are either close to 0 or to 1, and in the middle are not so many points.

Upvotes: 1

Related Questions