Alex
Alex

Reputation: 347

Why does scikit-learns Multiple Regression method generate error if input is numpy array?

Trying to create a Multiple Regression model, using scikit-learns linear_model class. All the examples that I could find online use pandas dataframes to load variables into the model. But I am trying to use numpy arrays, which leads to an error as described step by step below.

Trying to create a Multiple Regression model y = a0 + a1*x1 + a2*x2.

The independent variables x1 and x2 are one dimentional arrays with 36 values each:

x1 = [ 790, 1160,  929,  865, 1140,  929, 1109, 1365, 1112, 1150,  980,
         990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
        1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
        1390, 1405, 1395]

x2 = [1000, 1200, 1000,  900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
        1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
        2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
        1600, 1600, 2500]

Combining the independent variables into one numpy array:
X = np.array([x_1, x_2])

X = array([[ 790, 1160,  929,  865, 1140,  929, 1109, 1365, 1112, 1150,  980,
         990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
        1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
        1390, 1405, 1395],
       [1000, 1200, 1000,  900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
        1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
        2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
        1600, 1600, 2500]], dtype=int64)

The target variable:
y = array([ 99,  95,  95,  90, 105, 105,  90,  92,  98,  99,  99, 101,  99,
        94,  97,  97,  99, 104, 104, 105,  94,  99,  99,  99,  99, 102,
       104, 114, 109, 114, 115, 117, 104, 108, 109, 120], dtype=int64)

Training the model generates an error:
regr = linear_model.LinearRegression()
regr.fit(X, y)

This generates the following error. Why?

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-5d359e69e27d> in <module>
      1 regr = linear_model.LinearRegression()
----> 2 regr.fit(X, y)
      3 
      4 #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
      5 predictedCO2 = regr.predict([[3300, 1300]])

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight)
    503 
    504         n_jobs_ = self.n_jobs
--> 505         X, y = self._validate_data(X, y, accept_sparse=['csr', 'csc', 'coo'],
    506                                    y_numeric=True, multi_output=True)
    507 

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    810         y = y.astype(np.float64)
    811 
--> 812     check_consistent_length(X, y)
    813 
    814     return X, y

c:\users\donald seger\miniconda3\envs\tensorflow\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    253     uniques = np.unique(lengths)
    254     if len(uniques) > 1:
--> 255         raise ValueError("Found input variables with inconsistent numbers of"
    256                          " samples: %r" % [int(l) for l in lengths])
    257 

ValueError: Found input variables with inconsistent numbers of samples: [2, 36]

Upvotes: 0

Views: 101

Answers (1)

tlentali
tlentali

Reputation: 3455

Indeed, this is a rows and columns swapp as @BenReiniger noticed in this comment.
I tested your code and adding a .T as transpose to the X, it solves your problem :

>>> import numpy as np
>>> import sklearn

>>> x1 = [790, 1160,  929,  865, 1140,  929, 1109, 1365, 1112, 1150,  980,
...       990, 1112, 1252, 1326, 1330, 1365, 1280, 1119, 1328, 1584, 1428,
...       1365, 1415, 1415, 1465, 1490, 1725, 1523, 1705, 1605, 1746, 1235,
...       1390, 1405, 1395]
>>> x2 = [1000, 1200, 1000,  900, 1500, 1000, 1400, 1500, 1500, 1600, 1100,
...       1300, 1000, 1600, 1600, 1600, 1600, 2200, 1600, 2000, 1600, 2000,
...       2100, 1600, 2000, 1500, 2000, 2000, 1600, 2000, 2100, 2000, 1600,
...       1600, 1600, 2500]
>>> X = np.array([x1, x2]).T
>>> y = np.array([99,  95,  95,  90, 105, 105,  90,  92,  98,  99,  99, 101,  99,
...               94,  97,  97,  99, 104, 104, 105,  94,  99,  99,  99,  99, 102,
...               104, 114, 109, 114, 115, 117, 104, 108, 109, 120])
>>> regr = sklearn.linear_model.LinearRegression()
>>> regr.fit(X, y)
LinearRegression()

We have a LinearRegression() object in return as expected.

Upvotes: 1

Related Questions