Reputation: 383
The basic process for most supervised machine learning problems is to divide the dataset into a training set and test set and then train a model on the training set and evaluate its performance on the test set. But in many (most) settings, disease diagnosis for example, more data will be available in the future. How can I use this to improve upon the model? Do I need to retrain from scratch? When might be the appropriate time to retrain if this is the case (e.g., a specific percent of additional data points)?
Upvotes: 0
Views: 752
Reputation: 20302
Let’s take the example of predicting house prices. House prices change all the time. The data you used to train a machine learning model that predicts house prices six months ago could provide terrible predictions today. For house prices, it’s imperative that you have up-to-date information to train your models.
When designing a machine learning system it is important to understand how your data is going to change over time. A well-architected system should take this into account, and a plan should be put in place for keeping your models updated. Manual retraining
One way to maintain models with fresh data is to train and deploy your models using the same process you used to build your models in the first place. As you can imagine this process can be time-consuming. How often do you retrain your models? Weekly? Daily? There is a balance between cost and benefit. Costs in model retraining include:
On the other hand, as you are manually retraining your models you may discover a new algorithm or a different set of features that provide improved accuracy. Continuous learning
Another way to keep your models up-to-date is to have an automated system to continuously evaluate and retrain your models. This type of system is often referred to as continuous learning, and may look something like this:
The benefit to a continuous learning system is that it can be completely automated.
Upvotes: 1