ryfeus
ryfeus

Reputation: 363

Which techniques are used by SageMaker Neo for model optimizations

Does SageMaker Neo (SageMaker compilation job) use any techniques for model optimization? Are there any compression techniques used (distillation, quantization etc) to reduce the model size?

I found some description here (https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html) regarding quantization but it's not clear how it could be used.

Thanks very much for any insight.

Upvotes: 2

Views: 374

Answers (1)

Olivier Cruchant
Olivier Cruchant

Reputation: 4037

Neo is optimizing inference using compilation, which is different and often orthogonal to compression

  • compilation makes inference faster and lighter by specializing the prediction application, notably: (1) changing the environment in which the model runs, in particular replacing training frameworks by the least amount of necessary math libraries, (2) optimizing the model graph to be prediction-only and grouping together operators that can be, (3) specializing the runtime to use best the specific hardware and instructions available on a given target machine. Compilation is not supposed to change the model math, thereby doesn't change its footprint on disk

  • compression makes inference faster by removing model weights or making them smaller (quantization). Weights can be removed by pruning (dropping weights that do not influence much results or distillation (training a small model to mimic a big model).

At the time of this writing, SageMaker Neo is a managed compilation service. That being said, compilation and compression can be combined, and you can prune or distill your network before feeding it to Neo.

SageMaker Neo covers a large grid of hardware targets and model architectures, and consequently leverages numerous backends and optimizations. Neo internals are publicly documented in many places:

Upvotes: 1

Related Questions