Reputation: 33303
I have a large textfile.. I want to read this file and perform some manipulation in it..
This manipulation occurs independently on each line. So basically, I am looking for some function which can do this parallel.
void readFile(string filename){
//do manipulation
}
That do manipulation can happen in parallel.
Agreed that this can be done easily using hadoop but that is an overkill solution. (Its large file but not that large that I need hadoop for this...)
How do I do this in C++?
Upvotes: 4
Views: 19469
Reputation:
I suggest you use something like fread
to read many lines into a buffer and then operate on the buffer in parallel.
http://www.cplusplus.com/reference/cstdio/fread/
I once read an image one pixel (int) at a time, did a conversion to the pixel and then wrote the value to a buffer. That took well over 1 minute for a large file. When i instead used fread
to read the whole file into a buffer first and then do the conversion on the buffer in memory it took less than one second for the whole operation. That's a huge improvement without using any parallelism.
Since your file is so large you can read it in in chucks, operate on the chunk in parallel and then read in the next chuck. You could even read the next chuck (with one thread) while you're processing the previous chuck in parallel (with e.g. 7 threads) but you might find that's not even necessary. Personally, I would do the parallelism with OpenMP.
Edit: I forgot to mention that I gave an answer to use fread
to read in a file and process the lines in parallel with OpenMP
openmp - while loop for text file reading and using a pipeline
It would probably be simple to modify that code to do what you want to do.
Upvotes: 3
Reputation: 13196
If I were to be faced with this problem and have to solve it, I'd just use a single threaded approach, it's not worth it to put too much effort into it without speeding up the underlying medium.
Say you have this on a ramdisk, or a really fast raid, or something else, or the processing is somehow massively lopsided. Regardless of the scenario, line processing now takes the majority of the time.
I'd structure my solution something like this:
class ThreadPool; // encapsulates a set of threads
class WorkUnitPool; // encapsulates a set of threadsafe work unit queues
class ReadableFile; // an interface to a file that can be read from
ThreadPool pool;
WorkUnitPool workunits;
ReadableFile file;
pool.Attach(workunits); // bind threads to (initially empty) work unit pool
file.Open("input.file")
while (!file.IsAtEOF()) workunits.Add(ReadLineFrom(file));
pool.Wait(); // wait for all of the threads to finish processing work units
My "solution" is a generic, high level design intended to provoke thinking of what tools you have available that you can adapt to your needs. You will have to think carefully in order to use this, which is what I want.
As with any threaded operation, be very careful to design it properly, otherwise you will run into race conditions, data corruption, and all manner of pain. If you can find a thread pool/work unit library that does this for you, by all means use that.
Upvotes: 4
Reputation: 5785
I would use mmap for that. mmap gives you memory-like access to file so you can easly read in parallel. Please look at another stackoverflow topic about mmap. Be careful when usin non-read-only pattern with mmap.
Upvotes: 8