Reputation: 3460
My program first reads 2 images from HDD (with c++ file.read
function), and after that it performs calculations on GPU and CPU (a bunch of CUDA kernels mixed up with simple CPU calculations) with these images. I have about 2000 pairs of images to process. Reading and calculation time are almost equal. Is there relatively simple implementation of parallelization of reading and processing?
I know, that with CUDA streams I can evaluate kernels asynchronously with respect to host(CPU), but here my calculations are mixed and complicated. So, may be is it possible to use some kind of CPU multithreading?
What I want is:
readfromHDD(im-1);
readfromHDD(im-2);
for(int i=3;i<1998;i=i+2){
readfromHDD(im-i); | functions inside the
readfromHDD(im-(i+1)); | for loop are evaluated
ProcessGPU&CPU(im-(i-2),im-(i-1)); | concurrently
Synchronize_Reading_and_processing;
end
I think, there is no need to post my actual code. I never did multithreading before, so I don't know how it will work with CUDA kernels. Any hints are appreciated.
Thanks
Upvotes: 3
Views: 671
Reputation: 2250
I am very partial to pthreads and implementing an asynchronous wrapper ontop of a reader that synchronizes when you request the next set of data.
This is the easiest method I can think to implement. I've included something that should be easy to compile and fully demonstrate an implementation. Good luck.
main.cpp demonstrates the use.
#include "Reader.h"
#include "Reader_Async_Wrapper.h"
using namespace std;
int main() {
Reader *reader = new Reader("test");
Reader_Async_Wrapper async_reader(reader);
int img_index=0;
char* data;
data = async_reader.get_data();
while(((int*)data)[0]!=-1){
cout<<"processing image "<<img_index<<endl;
sleep(2);
cout<<"processed image "<<img_index++<<endl;
delete[] data;
data = async_reader.get_data();
}
return 0;
}
Reader.h is a simple serially implemented file i/o class
#include <iostream>
#include <fstream>
#include <unistd.h>
using namespace std;
class Reader{
public:
bool isFinished(){return finished;}
Reader(string file_name){
open_file(file_name);
finished=false;
img_index=0;
}
char* read_data(){
cout<<"Reading img: "<<img_index<<endl;
sleep(1);
cout<<"Read img: "<<img_index++<<endl;
if(img_index==10)finished=true;
return new char[1000];
}
private:
bool finished;
int img_index;
void open_file(string name){
// TODO
}
};
Reader_Async_Wrapper.h is a simple wrapper for Reader.h to make it run asynchronously
#include "Reader.h"
#include <pthread.h>
using namespace std;
class Reader_Async_Wrapper{
public:
pthread_t thread;
pthread_attr_t attr;
Reader* reader;
pthread_barrier_t barrier;
Reader_Async_Wrapper(Reader* reader):reader(reader){
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_barrier_init (&barrier, NULL, 2);
pthread_create(&thread, &attr, &Reader_Async_Wrapper::threadHelper, this);
finished=false;
image_data=NULL;
}
void finish(){
pthread_attr_destroy(&attr);
void *status;
pthread_join(thread, &status);
}
char* get_data(){
pthread_barrier_wait (&barrier);
return image_data;
}
void clear_buffer(char* old_image){
delete[] old_image;
}
private:
char* image_data;
static void *threadHelper(void * contx){
return ((Reader_Async_Wrapper *)contx)->async_loop();
}
bool finished;
void *async_loop(){
while(!finished){
if(reader->isFinished()){
finished=true;
image_data=new char[sizeof(int)];
((int*)image_data)[0]=-1;
}else
image_data=reader->read_data();
pthread_barrier_wait(&barrier);
}
pthread_exit(NULL);
return NULL;
}
};
I would suggest improving the processing associating with detecting end of file (assuming you're reading from a single long file). Otherwise I think you can easily expand this to your application.
This method should be sufficient as long as you are not aiming to processes many cases simultaneously and you are mainly using this as a method to hide the latency associated with reading the file.
If you want to processes many cases simultaneously you can use the wrapper to wrap reading and processing of the file. With respect to CUDA, I believe they should all share a CUDA context.
If you would like to be able to process in parallel on the GPU there are a few things I would recommend: Create multiple copies of the wrapper class, one for each parallel instance you'd like. Allocate enough memory once for each instance of the async in the class constructor. Designate a GPU thread to each thread so kernel's can run in parallel. Do all memory copies and kernel executions on the GPU thread.
Upvotes: 2
Reputation: 129524
There are probably thousands of different possible solutions. Here's what I would start with, to see how it works out:
Ingredients
Method:
Start read thread and processing thread.
Read thread reads two images at a time and sends those as a package in the message queue. Repeat until all images are read.
Processing thread reads message queue and processes the two images. Repeat until all images have been processed.
Stop threads and report result (as applicable)
It may help to give some "backpressure" for the message queue, so that when you have 4, 6 or 10 images already loaded, the reader thread stops reading images until there is space in the queue again.
The advantage with using a message queue in this way is that you have reasonable freedom between the threads, and the message queue arranges all the synchronisation between threads.
Upvotes: 2