Mark Lavin
Mark Lavin

Reputation: 1242

sklearn linear regression for 2D array

I have a Numpy 2D array in which the rows are individual time series and the columns correspond to the time points. I would like to fit a regression line to each of the rows to measure the trends of each time series, which I guess I could do (inefficiently) with a loop like:

array2D = ...
for row in array2D:
    coeffs = sklearn.metrics.LinearRegression().fit( row, range( len( row ) ).coef_
    ...

Is there a way to do this without a loop? What is the resulting shape of coeffs?

Upvotes: 1

Views: 4854

Answers (2)

Bernard Agbemadon
Bernard Agbemadon

Reputation: 109

For those like me who prefer the range as X and the timeserie data as y.

enter image description here

def linear_fit(periods, timeseries):
    # just format and add one column initialized at 1
    X_mat=np.vstack((np.ones(len(periods)), periods)).T
    
    # cf formula : linear-regression-using-matrix-multiplication
    tmp = np.linalg.inv(X_mat.T.dot(X_mat)).dot(X_mat.T)
    return tmp.dot(timeseries.T)[1] # 1 => coef_, 0 => intercept_

X = np.arange(8) # the periods, or the common range for all time series

y = np.array([ # time series
    [0., 0., 0., 0., 0., 73.92, 0., 114.32],
    [0., 0., 0., 0., 0., 73.92, 0., 114.32],
    [0., 10., 20., 30., 40., 50., 60., 70.]
])

linear_fit(X, y)
[1]: array([12.16666667, 12.16666667, 10.        ])

PS:This approach (linear regression with matrix multiplication) is a gold mine for large dataset

Upvotes: 2

user2653663
user2653663

Reputation: 2948

The coefficients that minimize the linear regression error are

enter image description here

You can solve for all rows in one go using numpy.

import numpy as np
from sklearn.linear_model import LinearRegression

def solve(timeseries):

    n_samples = timeseries.shape[1]
    # slope and offset/bias
    n_features = 2
    n_series = timeseries.shape[0]

    # For a single time series, X would be of shape
    # (n_samples, n_features) however in this case
    # it will be (n_samples. n_features, n_series)
    # The bias is added by having features being all 1's
    X = np.ones((n_samples, n_features, n_series))
    X[:, 1, :] = timeseries.T

    y = np.arange(n_samples)

    # A is the matrix to be inverted and will
    # be of shape (n_series, n_features, n_features)
    A = X.T @ X.transpose(2, 0, 1)
    A_inv = np.linalg.inv(A) 

    # Do the other multiplications step by step
    B = A_inv @ X.T
    C = B @ y 

    # Return only the slopes (which is what .coef_ does in sklearn)
    return C[:,1]

array2D = np.random.random((3,10))
coeffs_loop = np.empty(array2D.shape[0])
for i, row in enumerate(array2D):
    coeffs = LinearRegression().fit( row[:,None], range( len( row ) )).coef_
    coeffs_loop[i] = coeffs

coeffs_vectorized = solve(array2D)

print(np.allclose(coeffs_loop, coeffs_vectorized))

Upvotes: 3

Related Questions