Reputation: 181
I have a large dataset of 23k rows. That data looks like something below:
import pandas as pd
d = {'Date': ["1-1-2020", '1-1-2020', "1-2-2020", "1-2-2020"], 'Stock': ["FB", "F", "FB", "F"],
"last_price": [230,8,241,9], "price":[241,9,240,8.5]}
df = pd.DataFrame(data=d)
Date Stock_id last_price price
0 1-1-2020 5 230 241.0
1 1-1-2020 41 8 9.0
2 1-2-2020 5 241 240.0
3 1-2-2020 41 9 8.5
Note that data includes many stocks on many different dates. How can I create a model that uses the feature for example last_price and stock id to predict next-day price? And that uses the old data to re-train the data.
Now, this was the best thing I could do. I used LinearRegression but any other model advice can work.
X = df[['Stock_id', 'last_price']]
y = df[['price']]
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import linear_model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
Index Actual Predicted
487 45 32
4154 420 512
Is there a way where the model is trained on the first 3000 rows? Then the model makes a prediction for say date 12-11-2020 and then adds 12-11-2020 info to make the prediction for 12-12-2020 and so on?
I was hoping to get something like this.
Date Actual Predicted
12-11-2020 45 32
12-11-2020 420 512
12-12-2020 43 34
12-12-2020 423 513
Upvotes: 0
Views: 1067
Reputation: 11
I don't think having the id in your training dataset is appropriate since ids and comparing them does not give any useable information and may result in a bad calculated linear function for your model. ID just signifies that you are talking about a specific stock and is constant for a specific stock in the whole dataset. Also the value of the Stock_id cannot does not have any meaning that can be used for comparing stocks together, for example having a Stock_id = 1 and Stock_id = 2 doesn't mean these 2 are closer together than Stock_id = 1 and Stock_id = 100, they are just names. So I think you should split your original dataset based on the Stock_id and only include last_price in each of these new training datasets (X). You can do that in several ways, one them being the groupby function of pandas:
grouped = df.groupby(df.Stock_id)
stock_1= grouped.get_group(1)
After that, you can use a for loop on the unique value of your Stock_id column to get all the ids and their dataframes. Then you define a regression model for each of these new datasets and use the fit method to train it.
To retrain or update your regression model, LinearRegression does not support partial fit and I think you need to use the fit method again each time you want to update your model. You can use the first N rows of each user to fit the model, then predict the value for the next last_price and add the predicted value to the N rows and re-fit the model on the extended dataset. However, if your model actually calculates a good line to predict the data, I don't think you will see that much of a difference by adding new predictions to the training dataset.
Another option is to use SGDRegressor instead of LinearRegression, since it has a partial_fit()
method allows for incremental training which lets you train your model on new data without re-training the model on the whole dataset. You can find the documentation for this model here. Also this answer here explains the difference between SGDRegressor and Linear Regression.
If you still want to use LinearRegression and retrain the model, I suggest you use batches of data for updating your model, instead of retraining it on each new predicted value. You can wait for your predicted values to get to a certain number, for example 10, and then add these 10 new values to your training dataset and retrain the model just once. This answer here explains 3 approaches in retraining the model which might be useful for you.
Upvotes: 1