Reputation: 61
I need to run linear regression on boston housing dataset without using scikit.
This is what I've come up with till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as mlt
from sklearn.cross_validation import train_test_split
data = pd.read_csv("housing.csv", delimiter=' ',
skipinitialspace=True,
names=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
)
df_x = data.drop('MEDV', axis = 1)
df_y = data['MEDV']
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y,
test_size=0.2,
random_state=4
)
def hypothesis(x, theta):
return np.dot(x, theta.T)
def costfn(predictions, y, x):
a = 1 / (2 * len(x)) * np.sum((prediction - y) ** 2)
return a
def gradient(theta, alpha, predictions, x, y):
theta = np.subtract(theta, (alpha / len(x)) * np.dot(np.subtract(predictions, y).T, x))
return theta
alpha = 0.001
iters = 1000
theta = np.zeros([1, 13])
predictions = hypothesis(x_train, theta)
for i in range(iters):
predictions = hypothesis(x_train, theta)
theta = gradient(theta, alpha, predictions, x_train, y_train)
predictions = hypothesis(x_test, theta)
print(predictions)
I've taken and input and separated test and train cases and all that is working fine. But I'm getting this error -
Exception Traceback (most recent call last)
<ipython-input-33-36492e2820ce> in <module>
6 for i in range(iters):
7 predictions = hypothesis(x_train, theta)
----> 8 theta = gradient(theta, alpha, predictions, x_train, y_train)
9
10 predictions = hypothesis(x_test, theta)
<ipython-input-32-15d0b5b7bf16> in gradient(theta, alpha, predictions, x, y)
9
10
---> 11 theta = np.subtract(theta, (alpha / len(x)) * np.dot(np.subtract(predictions, y).T, x))
12 return theta
/usr/lib/python3/dist-packages/pandas/core/series.py in __array_wrap__(self, result, context)
502 """
503 return self._constructor(result, index=self.index,
--> 504 copy=False).__finalize__(self)
505
506 def __array_prepare__(self, result, context=None):
/usr/lib/python3/dist-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
262 else:
263 data = _sanitize_array(data, index, dtype, copy,
--> 264 raise_cast_failure=True)
265
266 data = SingleBlockManager(data, index, fastpath=True)
/usr/lib/python3/dist-packages/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
3273 elif subarr.ndim > 1:
3274 if isinstance(data, np.ndarray):
-> 3275 raise Exception('Data must be 1-dimensional')
3276 else:
3277 subarr = _asarray_tuplesafe(data, dtype=dtype)
Exception: Data must be 1-dimensional
Please help. Also, if my logic is wrong please tell me as I'm a beginner.
Upvotes: 2
Views: 1004
Reputation: 8112
pandas
is great for data management, but I tend to stick to NumPy objects for the mathematical steps. pandas
is trying to do something clever here, I don't know what, but if you pass df_x.values
and df_y.values
to train_test_split()
, your code runs:
x_train, x_test, y_train, y_test = train_test_split(df_x.values,
df_y.values,
test_size=0.2,
random_state=4
)
Upvotes: 1