Reputation: 4333
I am doing linear regression with multiple variables. To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. In my data I have n = 143 features and m = 13000 training examples. I want to plot house prices against area and show fitting line for this feature.
Data preparation code (Python):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = 'DB2.csv'
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
Getting theta coefficients with numpy.linalg.lstsq:
thetas = np.linalg.lstsq(X, y)[0]
Prediction part:
allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function
testingExamples = X.iloc[0:100,:] #Used to make predictions
predictions = testingExamples.dot(thetas)
Note: 120 in the above code is index of Area column in my dataset.
Visualization part:
fig, ax = plt.subplots(figsize=(18,10))
ax.scatter(allAreasData, y, label='Traning Data', color='r')
ax.plot(areasTestValues, predictions, 'b', label='Prediction')
ax.legend(loc=2)
ax.set_xlabel('Area')
ax.set_ylabel('Price')
ax.set_title('Predicted Price vs. House Area')
I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). What I am doing wrong? Scatter works right. But plot is not. For plot function I send 2 arguments:
1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data
Update:
After sorting x
I got this plot with curve:
I was expecting to get straight line fitting all my data with least square errors but instead got a curve. Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?
Upvotes: 1
Views: 1190
Reputation: 1480
Your result is linear in a 143 dimensional space. ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.
If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results.
Upvotes: 2