shyu_ee
shyu_ee

Reputation: 11

Why tflite model got slow speed in multi batch inference

I converted a tiny bert module to tflite and run the inference with the tensorflow lite c++ api.

When batch size=1, tensorflow lite performs average runtime 0.6ms, while tensorflow performs average runtime 1ms(with default threads num); when batch size=10, tensorflow lite performs average runtime 5ms, while tensorflow performs average runtime 3ms.

It seems tensorflow lite did nothing on multi thread speed up, as i tried to apply SetNumThreads(4).

And SetNumThreads(4) and SetNumThreads(1) performs same runtime, though cpu usage changed from 100% to 200%.

I am wondering is this a normal performance for tflite in the X86 desktop?

Here is the part of my custom tflite c++ code

class Session {
public:
Session() {
  model_ = NULL;
  interpreter_ = NULL;
}

bool Open(const std::string &saved_model) {
model_ = tflite::FlatBufferModel::BuildFromFile(saved_model.c_str());
if (!model_) {
  return false;
}

tflite::InterpreterBuilder(*model_.get(), resolver_)(&interpreter_);

if (!interpreter_) {
  return false;
}
interpreter_->SetNumThreads(4);
return true;
}

bool Run(std::vector<int> &dims, int32_t *tok_id, int32_t *msk_id, int32_t *seg_id, float *output) const {
 int tok_index = interpreter_->inputs()[2];
 int msk_index = interpreter_->inputs()[1];
 int seg_index = interpreter_->inputs()[0];
 interpreter_->ResizeInputTensor(tok_index, dims);
 interpreter_->ResizeInputTensor(msk_index, dims);
 interpreter_->ResizeInputTensor(seg_index, dims);

 if(interpreter_->AllocateTensors() != kTfLiteOk) //remove AllocateTensors() did not change the runtime
     return false;
 int32_t bytes = dims[0] * dims[1] * sizeof(int32_t);
 int32_t* tok_tensor = interpreter_->typed_tensor<int32_t>(tok_index);
 memcpy(tok_tensor, tok_id, bytes);
 int32_t* msk_tensor = interpreter_->typed_tensor<int32_t>(msk_index);
 memcpy(msk_tensor, msk_id, bytes);
 int32_t* seg_tensor = interpreter_->typed_tensor<int32_t>(seg_index);
 memcpy(seg_tensor, seg_id, bytes);
 if(interpreter_->Invoke() != kTfLiteOk)
     return false;
 bytes = dims[0] * sizeof(float);
 float* result = interpreter_->typed_output_tensor<float>(0);
 memcpy(output, result, bytes);
 return true;
}

private:
std::unique_ptr<tflite::FlatBufferModel> model_;
std::unique_ptr<tflite::Interpreter> interpreter_;
tflite::ops::builtin::BuiltinOpResolver resolver_;
};

Upvotes: 1

Views: 1944

Answers (1)

jdduke
jdduke

Reputation: 159

A few things:

  • TensorFlow Lite is optimized for mobile deployment, which in practice typically means ARM-based devices. While we are actively working on improvements to x86 performance for TFLite, inference latency improvements may not be as consistent as they are on ARM CPUs.
  • TensorFlow Lite is optimized for mobile-sized workloads, and may not be as optimized for larger batch sizes or multi-threading as standard TensorFlow.

If you're using a floating point model, the recently announced XNNPACK backend should provide some improvements for larger batch sizes and x86 devices. See also the blog post announcing support in TF 2.3 and the latest nightly builds. For now, you'll have to opt-in at build time, but we hope to enable this backend by default for more devices and models in the near future.

Upvotes: 2

Related Questions