Which techniques are used by SageMaker Neo for model optimizations

Question

Does SageMaker Neo (SageMaker compilation job) use any techniques for model optimization? Are there any compression techniques used (distillation, quantization etc) to reduce the model size?

I found some description here (https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html) regarding quantization but it's not clear how it could be used.

Thanks very much for any insight.

Olivier Cruchant · Accepted Answer

Neo is optimizing inference using compilation, which is different and often orthogonal to compression

compilation makes inference faster and lighter by specializing the prediction application, notably: (1) changing the environment in which the model runs, in particular replacing training frameworks by the least amount of necessary math libraries, (2) optimizing the model graph to be prediction-only and grouping together operators that can be, (3) specializing the runtime to use best the specific hardware and instructions available on a given target machine. Compilation is not supposed to change the model math, thereby doesn't change its footprint on disk
compression makes inference faster by removing model weights or making them smaller (quantization). Weights can be removed by pruning (dropping weights that do not influence much results or distillation (training a small model to mimic a big model).

At the time of this writing, SageMaker Neo is a managed compilation service. That being said, compilation and compression can be combined, and you can prune or distill your network before feeding it to Neo.

SageMaker Neo covers a large grid of hardware targets and model architectures, and consequently leverages numerous backends and optimizations. Neo internals are publicly documented in many places:

According to this blog, Neo uses Treelite for tree models optimization (Treelite: toolbox for decision tree deployment, Cho et Li)
According to its landing page, Neo uses Apache TVM too. TVM is the leading open-source compiler, developed by Tianqi Chen and the DMLC community (that also co-authored XGBoost and MXNet). TVM tricks are abundantly documented in TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (Chen et al)
According to this blog, Neo also sometimes leverages NVIDIA TensorRT, the official inference optimization stack from NVIDIA
Neo also uses a number of Amazon-developed optimization:
- A Unified Optimization Approach for CNN Model Inference on Integrated GPUs (Wang et al): "Our work is already deployed in Amazon SageMaker Neo Service"
- Optimizing CNN Model Inference on CPUs (Liu et al)"NeoCPU is used in Amazon SageMaker Neo Service"

Which techniques are used by SageMaker Neo for model optimizations

Answers (1)

Related Questions