Reputation: 765
I have written a full TF inference pipeline using C-backend. Currently, I am working on a hardware where I have multiple GPU (x8). It works well on CPU, and not really on GPU because I am not able to select correctly the devices.
The workflow is the following: a single thread has setup the session from a saved model
TF_LoadSessionFromSavedModel(...)
Then, a thread from a pool executes the usual workflow for the C-backend (setup input/ouput and run)
TF_NewTensor(...) // allocate input
TF_AllocateTensor(.....) // allocate ouput
TF_SessionRun(....)
Currently, I know on which device I want to execute my code so I am using the CUDA Driver API cudaSetDevice
however, it does not have any influence ( by default it is always on device 0, check with nvidia-smi). If I force the device using CUDA_VISIBLE_DEVICES
I can effectively select an other device ID, however CUDA_VISIBLE_DEVICE=0,1,2,3,4,5,6,7
combined with cudaSetDevice
does not work.
I am suspecting TF to force the device internally, maybe flexibility could be done using TF_SetConfig
, or run_options
of TF_SessionRun
. However, the documentation does not exist for the C backend. So if a TF wizard is here, I will appreciate advice to set up correctly the device to execute the TF_SessionRun
.
Upvotes: 2
Views: 945
Reputation: 765
I am bringing the answer to my question, after long exchange with Google TF dev, and long day of coding. The starting point: it is currently impossible to make multi gpu using the saved model. To succeed, I give the to do list:
Read the Protobufs in C and get TF_graph*. I provide you the C code to do that:
inline void deallocate_buffer(void *data, size_t) {
std::free(data);
}
inline TF_Buffer *read_buffer_from_file(const char *file) {
const auto f = std::fopen(file, "rb");
if (f == nullptr) {
return nullptr;
}
std::fseek(f, 0, SEEK_END);
const auto fsize = ftell(f);
std::fseek(f, 0, SEEK_SET);
if (fsize < 1) {
std::fclose(f);
return nullptr;
}
const auto data = std::malloc(fsize);
std::fread(data, fsize, 1, f);
std::fclose(f);
TF_Buffer *buf = TF_NewBuffer();
buf->data = data;
buf->length = fsize;
buf->data_deallocator = deallocate_buffer;
return buf;
}
inline TF_Graph *load_graph_def(const char *file) {
if (file == nullptr) {
return nullptr;
}
TF_Buffer *buffer = read_buffer_from_file(file);
if (buffer == nullptr) {
return nullptr;
}
TF_Graph *graph = TF_NewGraph();
TF_Status *status = TF_NewStatus();
TF_ImportGraphDefOptions *opts = TF_NewImportGraphDefOptions();
TF_GraphImportGraphDef(graph, buffer, opts, status);
TF_DeleteImportGraphDefOptions(opts);
TF_DeleteBuffer(buffer);
check_status(status);
TF_DeleteStatus(status);
return graph;
}
Convert the graph to graphdef and modify the device of execution for every node. I provide the main element.
TF_GraphToGraphDef(...)
// prepare the option for the new graph
std::string device = "/device:GPU:" + std::to_string(i); // i is an int
TF_ImportGraphDefOptions* graph_options = ...
TF_ImportGraphDefOptionsSetDefaultDevice(graph_options, device.c_str());
// create the new graph with correct device
TF_Graph *ngraph = TF_NewGraph();
TF_GraphImportGraphDef(ngraph, buffer, graph_options, status);
check_status(status);
// create the new session
TF_Session *session = TF_NewSession(ngraph, session_opts, status);
Important: a new session must be created for each GPU device. Up to you to manage it with your favorite threading system.
Last, session_opts for TF_NewSession must be setup correctly with a bit of black magic of the hex protobuf. (GPU ID and soft placement)
// this cryptic buffer is generated by
// import tensorflow as tf
// gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.5,
// allow_growth=True,visible_device_list='0,1,2,3,4,5,6,7')
// config = tf.compat.v1.ConfigProto(gpu_options=gpu_options,allow_soft_placement=True )
// serialized = config.SerializeToString()
// print(list(map(hex, serialized)))
std::vector<uint8_t> config = {0x32, 0x1c, 0x9, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe0, 0x3f,
0x20, 0x1, 0x2a, 0xf, 0x30, 0x2c, 0x31, 0x2c, 0x32, 0x2c, 0x33,
0x2c, 0x34, 0x2c, 0x35, 0x2c, 0x36, 0x2c, 0x37, 0x38, 0x1};
TF_SetConfig(session_opts, config.data(), config.size(), status);
I hope will this help people, who have to run their TF in C.
Upvotes: 2