Sergio
Sergio

Reputation: 197

Feature Store: Patterns for reusing the same features across different models

I'm looking to use a feature store to optimize feature reuse across many different models.

Example: I have 10 different models that use the same 2 feature sets (e.g.: 2 datasets of features without labels). The main difference is that each model predicts a different set of labels.

I could not find any well-known pattern on the web, so I've come up with 3 different strategies, that don't really convince me.

Less Reusable but simple solution: Given each feature set, “replicate” it and create one group for each model, with its dedicated labels. With 2 feature sets and 10 models, we would have 20 different groups that share the same features, except for the labels.

More reusable but complex solution (a): Create only 2 feature groups, but include the labels for all the models. Then, when creating the dataset, filter the group to retrieve only the label column for the specific model trained. With 2 feature sets and 10 models, you would only have 2 groups, each one of them with 10 extra columns, one for each label.

More reusable but complex solution (b): Create 2 feature groups, plus a feature group for each label set. Then, when creating the dataset, select the “shared” feature group and the one that contains the label column for the specific model trained. With 2 feature sets and 10 models, you would have 12 groups; the 2 “shared” ones plus 10, each one of them corresponding to a label set.

I would be keen to use the second solution, but I'm not experienced enough to understand the potential risks (versioning, lineage, maintainability, etc..)

What do you think? Would you suggest a different approach?

For reference, I'm working on AWS, using SageMaker Feature Store.

Upvotes: 0

Views: 171

Answers (1)

Davide Anghileri
Davide Anghileri

Reputation: 931

First I didn't get why you have "10 different models that use the same 2 feature sets", and not just 1 dataset.

In my opinion is a trade-off between reusability, maintainability, and complexity. I think that the main benefit of using a feature store is to share features between models/business cases. If the features are all the same by constraint I would go for 2b. Instead if it's a coincidence that now the features are all the same, but in the future each model can evolve independently, I would go for option 1 since it will allow to update the feature for one model without the risk of breaking other existing models.

Also I would consider if duplicating replicated features would encur in unnecessary costs, or degradation of performance, I'm not familiar with AWS feature store.

Upvotes: 0

Related Questions