Reputation: 11
I converted a tiny bert module to tflite and run the inference with the tensorflow lite c++ api.
When batch size=1, tensorflow lite performs average runtime 0.6ms, while tensorflow performs average runtime 1ms(with default threads num); when batch size=10, tensorflow lite performs average runtime 5ms, while tensorflow performs average runtime 3ms.
It seems tensorflow lite did nothing on multi thread speed up, as i tried to apply SetNumThreads(4).
And SetNumThreads(4) and SetNumThreads(1) performs same runtime, though cpu usage changed from 100% to 200%.
I am wondering is this a normal performance for tflite in the X86 desktop?
Here is the part of my custom tflite c++ code
class Session {
public:
Session() {
model_ = NULL;
interpreter_ = NULL;
}
bool Open(const std::string &saved_model) {
model_ = tflite::FlatBufferModel::BuildFromFile(saved_model.c_str());
if (!model_) {
return false;
}
tflite::InterpreterBuilder(*model_.get(), resolver_)(&interpreter_);
if (!interpreter_) {
return false;
}
interpreter_->SetNumThreads(4);
return true;
}
bool Run(std::vector<int> &dims, int32_t *tok_id, int32_t *msk_id, int32_t *seg_id, float *output) const {
int tok_index = interpreter_->inputs()[2];
int msk_index = interpreter_->inputs()[1];
int seg_index = interpreter_->inputs()[0];
interpreter_->ResizeInputTensor(tok_index, dims);
interpreter_->ResizeInputTensor(msk_index, dims);
interpreter_->ResizeInputTensor(seg_index, dims);
if(interpreter_->AllocateTensors() != kTfLiteOk) //remove AllocateTensors() did not change the runtime
return false;
int32_t bytes = dims[0] * dims[1] * sizeof(int32_t);
int32_t* tok_tensor = interpreter_->typed_tensor<int32_t>(tok_index);
memcpy(tok_tensor, tok_id, bytes);
int32_t* msk_tensor = interpreter_->typed_tensor<int32_t>(msk_index);
memcpy(msk_tensor, msk_id, bytes);
int32_t* seg_tensor = interpreter_->typed_tensor<int32_t>(seg_index);
memcpy(seg_tensor, seg_id, bytes);
if(interpreter_->Invoke() != kTfLiteOk)
return false;
bytes = dims[0] * sizeof(float);
float* result = interpreter_->typed_output_tensor<float>(0);
memcpy(output, result, bytes);
return true;
}
private:
std::unique_ptr<tflite::FlatBufferModel> model_;
std::unique_ptr<tflite::Interpreter> interpreter_;
tflite::ops::builtin::BuiltinOpResolver resolver_;
};
Upvotes: 1
Views: 1944
Reputation: 159
A few things:
If you're using a floating point model, the recently announced XNNPACK backend should provide some improvements for larger batch sizes and x86 devices. See also the blog post announcing support in TF 2.3 and the latest nightly builds. For now, you'll have to opt-in at build time, but we hope to enable this backend by default for more devices and models in the near future.
Upvotes: 2