Evan
Evan

Reputation: 383

How can you improve a machine learning model as more data becomes available?

The basic process for most supervised machine learning problems is to divide the dataset into a training set and test set and then train a model on the training set and evaluate its performance on the test set. But in many (most) settings, disease diagnosis for example, more data will be available in the future. How can I use this to improve upon the model? Do I need to retrain from scratch? When might be the appropriate time to retrain if this is the case (e.g., a specific percent of additional data points)?

Upvotes: 0

Views: 752

Answers (1)

ASH
ASH

Reputation: 20302

Let’s take the example of predicting house prices. House prices change all the time. The data you used to train a machine learning model that predicts house prices six months ago could provide terrible predictions today. For house prices, it’s imperative that you have up-to-date information to train your models.

When designing a machine learning system it is important to understand how your data is going to change over time. A well-architected system should take this into account, and a plan should be put in place for keeping your models updated. Manual retraining

One way to maintain models with fresh data is to train and deploy your models using the same process you used to build your models in the first place. As you can imagine this process can be time-consuming. How often do you retrain your models? Weekly? Daily? There is a balance between cost and benefit. Costs in model retraining include:

  1. Computational Costs
  2. Labor Costs
  3. Implementation Costs

On the other hand, as you are manually retraining your models you may discover a new algorithm or a different set of features that provide improved accuracy. Continuous learning

Another way to keep your models up-to-date is to have an automated system to continuously evaluate and retrain your models. This type of system is often referred to as continuous learning, and may look something like this:

  1. Save new training data as you receive it. For example, if you are receiving updated prices of houses on the market, save that information to a database.
  2. When you have enough new data, test its accuracy against your machine learning model.
  3. If you see the accuracy of your model degrading over time, use the new data, or a combination of the new data and old training data to build and deploy a new model.

The benefit to a continuous learning system is that it can be completely automated.

Upvotes: 1

Related Questions