TensorFlow C backend selection GPU

Question

I have written a full TF inference pipeline using C-backend. Currently, I am working on a hardware where I have multiple GPU (x8). It works well on CPU, and not really on GPU because I am not able to select correctly the devices.

The workflow is the following: a single thread has setup the session from a saved model

TF_LoadSessionFromSavedModel(...)

Then, a thread from a pool executes the usual workflow for the C-backend (setup input/ouput and run)

 TF_NewTensor(...) // allocate input 
 TF_AllocateTensor(.....) // allocate ouput 
 TF_SessionRun(....)

Currently, I know on which device I want to execute my code so I am using the CUDA Driver API cudaSetDevice however, it does not have any influence ( by default it is always on device 0, check with nvidia-smi). If I force the device using CUDA_VISIBLE_DEVICES I can effectively select an other device ID, however CUDA_VISIBLE_DEVICE=0,1,2,3,4,5,6,7 combined with cudaSetDevice does not work.

I am suspecting TF to force the device internally, maybe flexibility could be done using TF_SetConfig, or run_options of TF_SessionRun. However, the documentation does not exist for the C backend. So if a TF wizard is here, I will appreciate advice to set up correctly the device to execute the TF_SessionRun.

Timocaf&#233; · Accepted Answer

I am bringing the answer to my question, after long exchange with Google TF dev, and long day of coding. The starting point: it is currently impossible to make multi gpu using the saved model. To succeed, I give the to do list:

Convert the saved model to old fashion way frozen buffer (protobuf). Thus, the graph and the meta data will be into a single "file". The best way to make this conversion is python. This post provide all the info. We obtain a single Protobufs.

Read the Protobufs in C and get TF_graph*. I provide you the C code to do that:

inline void deallocate_buffer(void *data, size_t) { 
    std::free(data);
}

inline TF_Buffer *read_buffer_from_file(const char *file) {
    const auto f = std::fopen(file, "rb");
    if (f == nullptr) {
        return nullptr;
    }

   std::fseek(f, 0, SEEK_END);
   const auto fsize = ftell(f);
   std::fseek(f, 0, SEEK_SET);

   if (fsize < 1) {
      std::fclose(f);
      return nullptr;
   }

   const auto data = std::malloc(fsize);
   std::fread(data, fsize, 1, f);
   std::fclose(f);

   TF_Buffer *buf = TF_NewBuffer();
   buf->data = data;
   buf->length = fsize;
   buf->data_deallocator = deallocate_buffer;

   return buf;
}

inline TF_Graph *load_graph_def(const char *file) {
    if (file == nullptr) {
        return nullptr;
    }

   TF_Buffer *buffer = read_buffer_from_file(file);
   if (buffer == nullptr) {
       return nullptr;
   }

   TF_Graph *graph = TF_NewGraph();
   TF_Status *status = TF_NewStatus();
   TF_ImportGraphDefOptions *opts = TF_NewImportGraphDefOptions();

   TF_GraphImportGraphDef(graph, buffer, opts, status);
   TF_DeleteImportGraphDefOptions(opts);
   TF_DeleteBuffer(buffer);

   check_status(status);
   TF_DeleteStatus(status);
   return graph;
}

Convert the graph to graphdef and modify the device of execution for every node. I provide the main element.

TF_GraphToGraphDef(...)
// prepare the option for the new graph
std::string device = "/device:GPU:" + std::to_string(i); // i is an int
TF_ImportGraphDefOptions* graph_options = ...
TF_ImportGraphDefOptionsSetDefaultDevice(graph_options, device.c_str());
// create the new graph with correct device
TF_Graph *ngraph = TF_NewGraph();
TF_GraphImportGraphDef(ngraph, buffer, graph_options, status);
check_status(status);
// create the new session
TF_Session *session = TF_NewSession(ngraph, session_opts, status);

Important: a new session must be created for each GPU device. Up to you to manage it with your favorite threading system.

Last, session_opts for TF_NewSession must be setup correctly with a bit of black magic of the hex protobuf. (GPU ID and soft placement)

// this cryptic buffer is generated by
// import tensorflow as tf
// gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.5,
//                             allow_growth=True,visible_device_list='0,1,2,3,4,5,6,7')
//               config = tf.compat.v1.ConfigProto(gpu_options=gpu_options,allow_soft_placement=True )
//               serialized = config.SerializeToString()
//               print(list(map(hex, serialized)))

std::vector config = {0x32, 0x1c, 0x9,  0x0,  0x0,  0x0,  0x0,  0x0,  0x0,  0xe0, 0x3f,
                               0x20, 0x1,  0x2a, 0xf,  0x30, 0x2c, 0x31, 0x2c, 0x32, 0x2c, 0x33,
                               0x2c, 0x34, 0x2c, 0x35, 0x2c, 0x36, 0x2c, 0x37, 0x38, 0x1};

TF_SetConfig(session_opts, config.data(), config.size(), status);

I hope will this help people, who have to run their TF in C.

TensorFlow C backend selection GPU

Answers (1)

Related Questions