Mike
Mike

Reputation: 643

LinearRegression in Python giving incorrect results?

I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.

Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.

import pandas as pd

data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.

Data.

89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259

Upvotes: 0

Views: 1046

Answers (3)

David
David

Reputation: 775

from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")


import pandas as pd

data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)

import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 1831

Actually you are having problem understanding your own code.

import pandas as pd

data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.

Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.

labels represent the y values.
train1 represent the x values.

Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.

Now you train the model on x values which represents 'cm' and y values which represent 'kg'.

When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.

enter image description here

When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.

Upvotes: 2

Aditya Lahiri
Aditya Lahiri

Reputation: 419

I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).

You can find p-value using:

 from scipy.stats import linregress
 print(linregress(input, output))

To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.

Upvotes: -1

Related Questions