Reputation: 355
Usually after I trained my model, I would use the same GPU to do so.
However, do we still need a GPU instance for inference if I were to want to serve it online as a service? Or would a CPU instance suffice?
Thanks.
Upvotes: 1
Views: 1325
Reputation: 1123
In many cases, the CPU should be enough. But if it isn't, you can further optimize the inference with some toolkits such as OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
Here are some performance benchmarks for various models and CPUs.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Upvotes: 0
Reputation: 3358
The device would be cleared when you export a model. Here is the unit test for this feature: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/saved_model_test.py#L564
Copy from comment: GPU is fast when processing a large batch. When making inference for a single input, CPU is fast enough.
Upvotes: 3