frazman
frazman

Reputation: 33303

Reading a large text file in parallel in C++

I have a large textfile.. I want to read this file and perform some manipulation in it..

This manipulation occurs independently on each line. So basically, I am looking for some function which can do this parallel.

void readFile(string filename){

  //do manipulation

}

That do manipulation can happen in parallel.

Agreed that this can be done easily using hadoop but that is an overkill solution. (Its large file but not that large that I need hadoop for this...)

How do I do this in C++?

Upvotes: 4

Views: 19469

Answers (3)

user2088790
user2088790

Reputation:

I suggest you use something like fread to read many lines into a buffer and then operate on the buffer in parallel.

http://www.cplusplus.com/reference/cstdio/fread/

I once read an image one pixel (int) at a time, did a conversion to the pixel and then wrote the value to a buffer. That took well over 1 minute for a large file. When i instead used fread to read the whole file into a buffer first and then do the conversion on the buffer in memory it took less than one second for the whole operation. That's a huge improvement without using any parallelism.

Since your file is so large you can read it in in chucks, operate on the chunk in parallel and then read in the next chuck. You could even read the next chuck (with one thread) while you're processing the previous chuck in parallel (with e.g. 7 threads) but you might find that's not even necessary. Personally, I would do the parallelism with OpenMP.

Edit: I forgot to mention that I gave an answer to use fread to read in a file and process the lines in parallel with OpenMP openmp - while loop for text file reading and using a pipeline It would probably be simple to modify that code to do what you want to do.

Upvotes: 3

Wug
Wug

Reputation: 13196

If I were to be faced with this problem and have to solve it, I'd just use a single threaded approach, it's not worth it to put too much effort into it without speeding up the underlying medium.

Say you have this on a ramdisk, or a really fast raid, or something else, or the processing is somehow massively lopsided. Regardless of the scenario, line processing now takes the majority of the time.

I'd structure my solution something like this:

class ThreadPool; // encapsulates a set of threads
class WorkUnitPool; // encapsulates a set of threadsafe work unit queues
class ReadableFile; // an interface to a file that can be read from

ThreadPool pool;
WorkUnitPool workunits;
ReadableFile file;

pool.Attach(workunits); // bind threads to (initially empty) work unit pool

file.Open("input.file")
while (!file.IsAtEOF()) workunits.Add(ReadLineFrom(file));

pool.Wait(); // wait for all of the threads to finish processing work units

My "solution" is a generic, high level design intended to provoke thinking of what tools you have available that you can adapt to your needs. You will have to think carefully in order to use this, which is what I want.

As with any threaded operation, be very careful to design it properly, otherwise you will run into race conditions, data corruption, and all manner of pain. If you can find a thread pool/work unit library that does this for you, by all means use that.

Upvotes: 4

spinus
spinus

Reputation: 5785

I would use mmap for that. mmap gives you memory-like access to file so you can easly read in parallel. Please look at another stackoverflow topic about mmap. Be careful when usin non-read-only pattern with mmap.

Upvotes: 8

Related Questions