Asynchronous CPU reading and GPU+CPU calculations

Question

My program first reads 2 images from HDD (with c++ file.read function), and after that it performs calculations on GPU and CPU (a bunch of CUDA kernels mixed up with simple CPU calculations) with these images. I have about 2000 pairs of images to process. Reading and calculation time are almost equal. Is there relatively simple implementation of parallelization of reading and processing?

I know, that with CUDA streams I can evaluate kernels asynchronously with respect to host(CPU), but here my calculations are mixed and complicated. So, may be is it possible to use some kind of CPU multithreading?

What I want is:

readfromHDD(im-1);
readfromHDD(im-2);

for(int i=3;i<1998;i=i+2){
     readfromHDD(im-i);                  |  functions inside the
     readfromHDD(im-(i+1));              |  for loop are evaluated
     ProcessGPU&CPU(im-(i-2),im-(i-1));  |  concurrently
     Synchronize_Reading_and_processing;
end

I think, there is no need to post my actual code. I never did multithreading before, so I don't know how it will work with CUDA kernels. Any hints are appreciated.

Thanks

Christian Sarofeen · Accepted Answer

I am very partial to pthreads and implementing an asynchronous wrapper ontop of a reader that synchronizes when you request the next set of data.

This is the easiest method I can think to implement. I've included something that should be easy to compile and fully demonstrate an implementation. Good luck.

main.cpp demonstrates the use.

#include "Reader.h"
#include "Reader_Async_Wrapper.h"

using namespace std;

int main() {
    Reader *reader = new Reader("test");
    Reader_Async_Wrapper async_reader(reader);
    int img_index=0;
    char* data;
    data = async_reader.get_data();
    while(((int*)data)[0]!=-1){

        cout<<"processing image "<



Reader.h is a simple serially implemented file i/o class

#include 
#include 
#include 

using namespace std;
class Reader{
public:

    bool isFinished(){return finished;}

    Reader(string file_name){
        open_file(file_name);
        finished=false;
        img_index=0;
    }

    char* read_data(){
        cout<<"Reading img: "<


Reader_Async_Wrapper.h is a simple wrapper for Reader.h to make it run asynchronously

#include "Reader.h"
#include 

using namespace std;

class Reader_Async_Wrapper{
public:

    pthread_t thread;
    pthread_attr_t attr;
    Reader* reader;
    pthread_barrier_t barrier;
    Reader_Async_Wrapper(Reader* reader):reader(reader){

        pthread_attr_init(&attr);
        pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
        pthread_barrier_init (&barrier, NULL, 2);
        pthread_create(&thread, &attr, &Reader_Async_Wrapper::threadHelper, this);
        finished=false;
        image_data=NULL;
    }

    void finish(){
        pthread_attr_destroy(&attr);
        void *status;
        pthread_join(thread, &status);
    }

    char* get_data(){
        pthread_barrier_wait (&barrier);
        return image_data;
    }
    void clear_buffer(char* old_image){
        delete[] old_image;
    }

private:
    char* image_data;

    static void *threadHelper(void * contx){
        return ((Reader_Async_Wrapper *)contx)->async_loop();
    }

    bool finished;
    void *async_loop(){
        while(!finished){
            if(reader->isFinished()){
                finished=true;
                image_data=new char[sizeof(int)];
                ((int*)image_data)[0]=-1;
            }else
                image_data=reader->read_data();
            pthread_barrier_wait(&barrier);
        }
        pthread_exit(NULL);
        return NULL;
    }

};


I would suggest improving the processing associating with detecting end of file (assuming you're reading from a single long file). Otherwise I think you can easily expand this to your application.

This method should be sufficient as long as you are not aiming to processes many cases simultaneously and you are mainly using this as a method to hide the latency associated with reading the file.

If you want to processes many cases simultaneously you can use the wrapper to wrap reading and processing of the file. With respect to CUDA, I believe they should all share a CUDA context.

If you would like to be able to process in parallel on the GPU there are a few things I would recommend:
Create multiple copies of the wrapper class, one for each parallel instance you'd like.
Allocate enough memory once for each instance of the async in the class constructor.
Designate a GPU thread to each thread so kernel's can run in parallel.
Do all memory copies and kernel executions on the GPU thread.

Asynchronous CPU reading and GPU+CPU calculations

Answers (2)

Related Questions