ameerosein
ameerosein

Reputation: 563

what is the optimal Multithreading scenario for processing a long file lines?

I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.

One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,

I myself though this could be a better scenario:

One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:

One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:

Global variables:

#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";

Function func1 : (thread 1)

void func1(){
 ifstream ifstr(file.c_str());
 for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
    getline(ifstr,ReadSeq);
    getline(ifstr,ReadSeq);// reading even lines
    while( line1Buffer.size() == MAX_BUFFER_SIZE )
        ; // to delay when the buffer is full
    line1Buffer.push_back(ReadSeq);
 }
 continue = false;
 return;
}

And function func2 : (other threads)

void func2(){
 string ReadSeq;
 while(continue){
    if(line2Buffer.size() > 0 ){
        ReadSeq  = line1Buffer.pop_back();
        // do the proccessing....
    }
 }
}

About the speed:

If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...

and so there is 2 question:

1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?

2- is there any faster scenario?

3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true

Edit: the 3rd question is not right as multiple threads can't help in reading a single file

Thanks All

Upvotes: 1

Views: 217

Answers (4)

iutamir
iutamir

Reputation: 11

Assume having #p treads, two scenarios mentioned in the post and answers:

1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:

worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime

2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.

*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime

Upvotes: 1

Domso
Domso

Reputation: 970

An other approach could be interleaved thread. Reading is done by every thread, but only 1 at once. Because of the waiting in the very first iteration, the threads will be interleaved.

But this is only an scaleable option, if work() is the bottleneck (then every non-parallel execution would be better)

Thread:

while (!end) {
    // should be fair!
    lock();
    read();
    unlock();

    work();
}

basic example: (you should probably add some error-handling)

void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
    std::string line;
    std::vector<std::string> data;
    int i;
    do {
        i = 0;
        // only 1 concurrent reader
        mutex->lock();
        // try to read the maximum number of lines
        while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
            // only the even lines we want to process
            if (*global_line_counter % 2 == 0) {
                data.push_back(line);
                i++;
            }
            (*global_line_counter)++;

        }
        mutex->unlock();

        // execute work for every line
        for (int j=0; j < data.size(); j++) {
            work(data[j]);
        }

        // free old data
        data.clear();
     //until EOF was not reached
   } while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);

}

void process_data(std::string file) {
     // counter for checking if line is even
     int global_line_counter = 0;
     // open file
     ifstream ifstr(file.c_str());
     // mutex for synchronization
     // maybe a fair-lock would be a better solution
     std::mutex mutex;
     // create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
     std::vector<std::thread> threads(NUM_THREADS);
     for (int i=0; i < NUM_THREADS; i++) {
         threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
     }
     // wait until all threads have finished
     for (int i=0; i < NUM_THREADS; i++) {
         threads[i].join();
     }
}

Upvotes: 1

doron
doron

Reputation: 28872

First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.

If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.

One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.

Upvotes: 0

UKMonkey
UKMonkey

Reputation: 6983

What is your bottleneck? Hard disk or processing time?

If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.

A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.

Global variables:

This is a bad habit to get into.

Upvotes: 1

Related Questions