Reputation: 814
I'd like to add weights to my training data based on its recency.
If we look at a simple example:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, normalize
from sklearn.linear_model import LinearRegression
X = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1)
Y = np.array([0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6, 10]).reshape(-1,1)
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, Y)
plt.scatter(X, Y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
Now imagine that the X values are time-based and the Y value is a snapshot of a sensor. So we're modeling some behavior over time. I believe the newest data points are the most important as they are the most recent and most indicative of future behavior. I'd like to adjust my model such that the newest data points are weighted the highest.
There is a question about doing this in R: https://stats.stackexchange.com/questions/196653/assigning-more-weight-to-more-recent-observations-in-regression
I'm wondering if the sklearn package (or any other python packages) has this feature?
This weighted model would have a similar curve but would fit the newer points better. If I want to use this model to predict the future, the non-weighted models will always be too conservative in their prediction as they won't be as sensitive to the newest data.
Other than using this approach I've also used curve_fit to use a power function or exponential function:
from scipy.optimize import curve_fit
def func(x, a, b):
return a*(x**b)
X = [1,2,3,4,5,6,7,8,9,10]
Y = [0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6, 10]
popt, pcov = curve_fit(func, X, Y, bounds=([-np.inf,1], [np.inf, np.inf]))
plt.plot(X, func(X, *popt), color = 'green')
If a solution using func
and curve_fit
is possible I'm open to that too, or any other methods. The only caveat is that my real-world data doesn't always imply the solution is a monotonically increasing function, but my ideal solution will be.
Upvotes: 3
Views: 6014
Reputation: 959
As implemeted from scratch:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, normalize
from sklearn.linear_model import LinearRegression
#%matplotlib inline
X = np.array([1,2,3,4,5,6,7,8,9,10]).reshape(-1,1)
#Weights.sum() = 1
w = np.exp(X)/sum(np.exp(X))
Y = np.array([0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 6, 10]).reshape(-1,1)
poly_reg = PolynomialFeatures(degree=2)
#Vandermonde Matrix
X_poly = poly_reg.fit_transform(X)
#Solve Weighted Normal Equation
A = np.linalg.inv(X_poly.T @ (w*X_poly))
beta = (A @ X_poly.T) @ (w*Y)
#Define Ploynomial - Use Numpy for optimzation
def polynomial(x, coeff):
y = 0
for p, c in enumerate(coeff):
y += c * x**p
return y
plt.scatter(X, Y, color='red')
plt.plot(X, polynomial(X, beta), color='blue')
#Source https://en.wikipedia.org/wiki/Weighted_least_squares#Introduction
Note that this does the same as Teo's answer and his one is shorter.
Upvotes: 2