How to perform multivariable linear regression with scikit-learn?

Question

Forgive my terminology, I'm not an ML pro. I might use the wrong terms below.

I'm trying to perform multivariable linear regression. Let's say I'm trying to work out user gender by analysing page views on a web site.

For each user whose gender I know, I have a feature matrix where each row represents a web site section, and the second element whether they visited it, e.g.:

male1 = [
    [1, 1],     # visited section 1
    [2, 0],     # didn't visit section 2
    [3, 1],     # visited section 3, etc
    [4, 0]
]

So in scikit, I am building xs and ys. I'm representing a male as 1, and female as 0.

The above would be represented as:

features = male1
gender = 1

Now, I'm obviously not just training a model for a single user, but instead I have tens of thousands of users whose data I'm using for training.

I would have thought I should create my xs and ys as follows:

xs = [
    [          # user1
       [1, 1],    
       [2, 0],     
       [3, 1],    
       [4, 0]
    ],
    [          # user2
       [1, 0],    
       [2, 1],     
       [3, 1],    
       [4, 0]
    ],
    ...
]

ys = [1, 0, ...]

scikit doesn't like this:

from sklearn import linear_model

clf = linear_model.LinearRegression()
clf.fit(xs, ys)

It complains:

ValueError: Found array with dim 3. Estimator expected <= 2.

How am I supposed to supply a feature matrix to the linear regression algorithm in scikit-learn?

Tonechas · Accepted Answer

You need to create xs in a different way. According to the docs:

fit(X, y, sample_weight=None)

Parameters:

    X : numpy array or sparse matrix of shape [n_samples, n_features]
        Training data
    y : numpy array of shape [n_samples, n_targets]
        Target values
    sample_weight : numpy array of shape [n_samples]
        Individual weights for each sample

Hence xs should be a 2D array with as many rows as users and as many columns as web site sections. You defined xs as a 3D array though. In order to reduce the number of dimensions by one you could get rid of the section numbers through a list comprehension:

xs = [[visit for section, visit in user] for user in xs]

If you do so, the data you provided as an example gets transformed into:

xs = [[1, 0, 1, 0], # user1
      [0, 1, 1, 0], # user2
      ...
      ]

and clf.fit(xs, ys) should work as expected.

A more efficient approach to dimension reduction would be that of slicing a NumPy array:

import numpy as np
xs = np.asarray(xs)[:,:,1]

How to perform multivariable linear regression with scikit-learn?

Answers (1)

Related Questions