Olivia
Olivia

Reputation: 814

weighted regression sklearn

I'd like to add weights to my training data based on its recency.

If we look at a simple example:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, normalize
from sklearn.linear_model import LinearRegression

X = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1)
Y = np.array([0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6, 10]).reshape(-1,1)

poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, Y)

plt.scatter(X, Y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')

enter image description here

Now imagine that the X values are time-based and the Y value is a snapshot of a sensor. So we're modeling some behavior over time. I believe the newest data points are the most important as they are the most recent and most indicative of future behavior. I'd like to adjust my model such that the newest data points are weighted the highest.

There is a question about doing this in R: https://stats.stackexchange.com/questions/196653/assigning-more-weight-to-more-recent-observations-in-regression

I'm wondering if the sklearn package (or any other python packages) has this feature?

This weighted model would have a similar curve but would fit the newer points better. If I want to use this model to predict the future, the non-weighted models will always be too conservative in their prediction as they won't be as sensitive to the newest data.

Other than using this approach I've also used curve_fit to use a power function or exponential function:

from scipy.optimize import curve_fit

def func(x, a, b):
    return a*(x**b)

X = [1,2,3,4,5,6,7,8,9,10]
Y = [0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6, 10]

popt, pcov = curve_fit(func, X, Y, bounds=([-np.inf,1], [np.inf, np.inf]))
plt.plot(X, func(X, *popt), color = 'green')

If a solution using func and curve_fit is possible I'm open to that too, or any other methods. The only caveat is that my real-world data doesn't always imply the solution is a monotonically increasing function, but my ideal solution will be.

Upvotes: 3

Views: 6014

Answers (2)

AlexNe
AlexNe

Reputation: 959

As implemeted from scratch:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, normalize
from sklearn.linear_model import LinearRegression

#%matplotlib inline

X = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1)
#Weights.sum() = 1 
w = np.exp(X)/sum(np.exp(X))

Y = np.array([0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6, 10]).reshape(-1,1)

poly_reg = PolynomialFeatures(degree=2)
#Vandermonde Matrix
X_poly = poly_reg.fit_transform(X)

#Solve Weighted Normal Equation
A = np.linalg.inv(X_poly.T @ (w*X_poly))
beta = (A @ X_poly.T) @ (w*Y)

#Define Ploynomial - Use Numpy for optimzation
def polynomial(x, coeff):
    y = 0
    for p, c in enumerate(coeff):
        y += c * x**p
    return y

plt.scatter(X, Y, color='red')
plt.plot(X, polynomial(X, beta), color='blue')

#Source https://en.wikipedia.org/wiki/Weighted_least_squares#Introduction

Note that this does the same as Teo's answer and his one is shorter.

Upvotes: 2

teoML
teoML

Reputation: 836

I took a look at sklearn's LinearRegression API here and I saw that the class has a fit() method which has the following signature: fit(self, X, y[, sample_weight]) So,you can actually give it a weight vector for your samples as far as I understand.

Upvotes: 7

Related Questions