Timocafé
Timocafé

Reputation: 765

TensorFlow C backend selection GPU

I have written a full TF inference pipeline using C-backend. Currently, I am working on a hardware where I have multiple GPU (x8). It works well on CPU, and not really on GPU because I am not able to select correctly the devices.

The workflow is the following: a single thread has setup the session from a saved model

TF_LoadSessionFromSavedModel(...)

Then, a thread from a pool executes the usual workflow for the C-backend (setup input/ouput and run)

 TF_NewTensor(...) // allocate input 
 TF_AllocateTensor(.....) // allocate ouput 
 TF_SessionRun(....) 

Currently, I know on which device I want to execute my code so I am using the CUDA Driver API cudaSetDevice however, it does not have any influence ( by default it is always on device 0, check with nvidia-smi). If I force the device using CUDA_VISIBLE_DEVICES I can effectively select an other device ID, however CUDA_VISIBLE_DEVICE=0,1,2,3,4,5,6,7 combined with cudaSetDevice does not work.

I am suspecting TF to force the device internally, maybe flexibility could be done using TF_SetConfig, or run_options of TF_SessionRun. However, the documentation does not exist for the C backend. So if a TF wizard is here, I will appreciate advice to set up correctly the device to execute the TF_SessionRun.

Upvotes: 2

Views: 945

Answers (1)

Timocafé
Timocafé

Reputation: 765

I am bringing the answer to my question, after long exchange with Google TF dev, and long day of coding. The starting point: it is currently impossible to make multi gpu using the saved model. To succeed, I give the to do list:

  1. Convert the saved model to old fashion way frozen buffer (protobuf). Thus, the graph and the meta data will be into a single "file". The best way to make this conversion is python. This post provide all the info. We obtain a single Protobufs.
  2. Read the Protobufs in C and get TF_graph*. I provide you the C code to do that:

    inline void deallocate_buffer(void *data, size_t) { 
        std::free(data);
    }
    
    inline TF_Buffer *read_buffer_from_file(const char *file) {
        const auto f = std::fopen(file, "rb");
        if (f == nullptr) {
            return nullptr;
        }
    
       std::fseek(f, 0, SEEK_END);
       const auto fsize = ftell(f);
       std::fseek(f, 0, SEEK_SET);
    
       if (fsize < 1) {
          std::fclose(f);
          return nullptr;
       }
    
       const auto data = std::malloc(fsize);
       std::fread(data, fsize, 1, f);
       std::fclose(f);
    
       TF_Buffer *buf = TF_NewBuffer();
       buf->data = data;
       buf->length = fsize;
       buf->data_deallocator = deallocate_buffer;
    
       return buf;
    }
    
    inline TF_Graph *load_graph_def(const char *file) {
        if (file == nullptr) {
            return nullptr;
        }
    
       TF_Buffer *buffer = read_buffer_from_file(file);
       if (buffer == nullptr) {
           return nullptr;
       }
    
       TF_Graph *graph = TF_NewGraph();
       TF_Status *status = TF_NewStatus();
       TF_ImportGraphDefOptions *opts = TF_NewImportGraphDefOptions();
    
       TF_GraphImportGraphDef(graph, buffer, opts, status);
       TF_DeleteImportGraphDefOptions(opts);
       TF_DeleteBuffer(buffer);
    
       check_status(status);
       TF_DeleteStatus(status);
       return graph;
    }
    
  3. Convert the graph to graphdef and modify the device of execution for every node. I provide the main element.

    TF_GraphToGraphDef(...)
    // prepare the option for the new graph
    std::string device = "/device:GPU:" + std::to_string(i); // i is an int
    TF_ImportGraphDefOptions* graph_options = ...
    TF_ImportGraphDefOptionsSetDefaultDevice(graph_options, device.c_str());
    // create the new graph with correct device
    TF_Graph *ngraph = TF_NewGraph();
    TF_GraphImportGraphDef(ngraph, buffer, graph_options, status);
    check_status(status);
    // create the new session
    TF_Session *session = TF_NewSession(ngraph, session_opts, status);
    

    Important: a new session must be created for each GPU device. Up to you to manage it with your favorite threading system.

  4. Last, session_opts for TF_NewSession must be setup correctly with a bit of black magic of the hex protobuf. (GPU ID and soft placement)

    // this cryptic buffer is generated by
    // import tensorflow as tf
    // gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.5,
    //                             allow_growth=True,visible_device_list='0,1,2,3,4,5,6,7')
    //               config = tf.compat.v1.ConfigProto(gpu_options=gpu_options,allow_soft_placement=True )
    //               serialized = config.SerializeToString()
    //               print(list(map(hex, serialized)))
    
    std::vector<uint8_t> config = {0x32, 0x1c, 0x9,  0x0,  0x0,  0x0,  0x0,  0x0,  0x0,  0xe0, 0x3f,
                                   0x20, 0x1,  0x2a, 0xf,  0x30, 0x2c, 0x31, 0x2c, 0x32, 0x2c, 0x33,
                                   0x2c, 0x34, 0x2c, 0x35, 0x2c, 0x36, 0x2c, 0x37, 0x38, 0x1};
    
    TF_SetConfig(session_opts, config.data(), config.size(), status);
    

I hope will this help people, who have to run their TF in C.

Upvotes: 2

Related Questions