Reputation: 197
I'm looking to use a feature store to optimize feature reuse across many different models.
Example: I have 10 different models that use the same 2 feature sets (e.g.: 2 datasets of features without labels). The main difference is that each model predicts a different set of labels.
I could not find any well-known pattern on the web, so I've come up with 3 different strategies, that don't really convince me.
Less Reusable but simple solution: Given each feature set, “replicate” it and create one group for each model, with its dedicated labels. With 2 feature sets and 10 models, we would have 20 different groups that share the same features, except for the labels.
More reusable but complex solution (a): Create only 2 feature groups, but include the labels for all the models. Then, when creating the dataset, filter the group to retrieve only the label column for the specific model trained. With 2 feature sets and 10 models, you would only have 2 groups, each one of them with 10 extra columns, one for each label.
More reusable but complex solution (b): Create 2 feature groups, plus a feature group for each label set. Then, when creating the dataset, select the “shared” feature group and the one that contains the label column for the specific model trained. With 2 feature sets and 10 models, you would have 12 groups; the 2 “shared” ones plus 10, each one of them corresponding to a label set.
I would be keen to use the second solution, but I'm not experienced enough to understand the potential risks (versioning, lineage, maintainability, etc..)
What do you think? Would you suggest a different approach?
For reference, I'm working on AWS, using SageMaker Feature Store.
Upvotes: 0
Views: 171
Reputation: 931
First I didn't get why you have "10 different models that use the same 2 feature sets", and not just 1 dataset.
In my opinion is a trade-off between reusability, maintainability, and complexity. I think that the main benefit of using a feature store is to share features between models/business cases. If the features are all the same by constraint I would go for 2b. Instead if it's a coincidence that now the features are all the same, but in the future each model can evolve independently, I would go for option 1 since it will allow to update the feature for one model without the risk of breaking other existing models.
Also I would consider if duplicating replicated features would encur in unnecessary costs, or degradation of performance, I'm not familiar with AWS feature store.
Upvotes: 0