Reputation: 347
Trying to create a Multiple Regression model, using scikit-learns linear_model class. All the examples that I could find online use pandas dataframes to load variables into the model. But I am trying to use numpy arrays, which leads to an error as described step by step below.
Trying to create a Multiple Regression model y = a0 + a1*x1 + a2*x2.
The independent variables x1 and x2 are one dimentional arrays with 36 values each:
x1 = [ 790, 1160, 929, 865, 1140, 929, 1109, 1365, 1112, 1150, 980,
990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
1390, 1405, 1395]
x2 = [1000, 1200, 1000, 900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
1600, 1600, 2500]
Combining the independent variables into one numpy array:
X = np.array([x_1, x_2])
X = array([[ 790, 1160, 929, 865, 1140, 929, 1109, 1365, 1112, 1150, 980,
990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
1390, 1405, 1395],
[1000, 1200, 1000, 900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
1600, 1600, 2500]], dtype=int64)
The target variable:
y = array([ 99, 95, 95, 90, 105, 105, 90, 92, 98, 99, 99, 101, 99,
94, 97, 97, 99, 104, 104, 105, 94, 99, 99, 99, 99, 102,
104, 114, 109, 114, 115, 117, 104, 108, 109, 120], dtype=int64)
Training the model generates an error:
regr = linear_model.LinearRegression()
regr.fit(X, y)
This generates the following error. Why?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-5d359e69e27d> in <module>
1 regr = linear_model.LinearRegression()
----> 2 regr.fit(X, y)
3
4 #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
5 predictedCO2 = regr.predict([[3300, 1300]])
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight)
503
504 n_jobs_ = self.n_jobs
--> 505 X, y = self._validate_data(X, y, accept_sparse=['csr', 'csc', 'coo'],
506 y_numeric=True, multi_output=True)
507
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
430 y = check_array(y, **check_y_params)
431 else:
--> 432 X, y = check_X_y(X, y, **check_params)
433 out = X, y
434
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
810 y = y.astype(np.float64)
811
--> 812 check_consistent_length(X, y)
813
814 return X, y
c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
253 uniques = np.unique(lengths)
254 if len(uniques) > 1:
--> 255 raise ValueError("Found input variables with inconsistent numbers of"
256 " samples: %r" % [int(l) for l in lengths])
257
ValueError: Found input variables with inconsistent numbers of samples: [2, 36]
Upvotes: 0
Views: 101
Reputation: 3455
Indeed, this is a rows and columns swapp as @BenReiniger noticed in this comment.
I tested your code and adding a .T
as transpose to the X
, it solves your problem :
>>> import numpy as np
>>> import sklearn
>>> x1 = [790, 1160, 929, 865, 1140, 929, 1109, 1365, 1112, 1150, 980,
... 990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
... 1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
... 1390, 1405, 1395]
>>> x2 = [1000, 1200, 1000, 900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
... 1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
... 2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
... 1600, 1600, 2500]
>>> X = np.array([x1, x2]).T
>>> y = np.array([99, 95, 95, 90, 105, 105, 90, 92, 98, 99, 99, 101, 99,
... 94, 97, 97, 99, 104, 104, 105, 94, 99, 99, 99, 99, 102,
... 104, 114, 109, 114, 115, 117, 104, 108, 109, 120])
>>> regr = sklearn.linear_model.LinearRegression()
>>> regr.fit(X, y)
LinearRegression()
We have a LinearRegression()
object in return as expected.
Upvotes: 1