Parsing large file with MPI in C++

Question

I have a C++ program in which I want to parse a huge file, looking for some regex that I've implemented. The program was working ok when executed sequentially but then I wanted to run it using MPI.

I started the adaptation to MPI by differentiating the master (the one who coordinates the execution) from the workers (the ones that parse the file in parallel) in the main function. Something like this:

MPI::Init(argc, argv);
...

if(rank == 0) {
    ...

    // Master sends initial and ending byte to every worker
    for(int i = 1; i < total_workers; i++) {
        array[0] = (i-1) * first_worker_file_part;
        array[1] = i * first_worker_file_part;
        MPI::COMM_WORLD.Send(array, 2, MPI::INT, i, 1);
    }
}

if(rank != 0)
    readDocument();

...

MPI::Finalize();

The master will send to every worker an array with 2 position that contains the byte where it will start the reading of the file in position 0 and the byte where it needs to stop reading in position 1.

The readDocument() function looks like this by now (not parsing, just each worker reading his part of the file):

void readDocument()
{
    array = new int[2];
    MPI::COMM_WORLD.Recv(array, 10, MPI::INT, 0, 1, status);
    int read_length = array[1] - array[0];
    char* buffer = new char [read_length];

    if (infile)
    {
        infile.seekg(array[0]); // Start reading in supposed byte
        infile.read(buffer, read_length);
    }
}

I've tried different examples, from writing to a file the output of the reading to running it with different number of processes. What happens is that when I run the program with 20 processes instead of 10, for example, it lasts twice the time to read the file. I expected it to be nearly half the time and I can't figure why this is happening.

Also, in a different matter, I want to make the master wait for all the workers to complete their execution and then print the final time. Is there any way to "block" him while the workers are processing? Like a cond_wait in C pthreads?

High Performance Mark · Accepted Answer

In my experience people working on computer systems with parallel file systems tend to know about those parallel file systems so your question marks you out, initially, as someone not working on such a system.

Without specific hardware support reading from a single file boils down to the system positioning a single read head and reading a sequence of bytes from the disk to memory. This situation is not materially altered by the complex realities of many modern file systems, such as RAID, which may in fact store a file across multiple disks. When multiple processes ask the operating system for access to files at the same time the o/s parcels out disk access according to some notion, possibly of fairness, so that no process gets starved. At worst the o/s spends so much time switching disk access from process to process that the rate of reading drops significantly. The most efficient, in terms of throughput, approach is for a single process to read an entire file in one go while other processes do other things.

This situation, multiple processes contending for scarce disk i/o resources, applies whether or not those processes are part of a parallel, MPI (or similar) program or entirely separate programs running concurrently.

The impact is what you observe -- instead of 10 processes each waiting to get their own 1/10th share of the file you have 20 processes each waiting for their 1/20th share. Oh, you cry, but each process is only reading half as much data so the whole gang should take the same amount of time to get the file. No, I respond, you've forgotten to add the time it takes the o/s to position and reposition the read/write heads between accesses. Read time comprises latency (how long does it take reading to start once the request has been made) and throughput (how fast can the i/o system pass the bytes to and fro).

It should be easy to come up with some reasonable estimates of latency and bandwidth that explains the twice as long reading by 20 processes as by 10.

How can you solve this ? You can't, not without a parallel file system. But you might find that having the master process read the whole file and then parcel it out to be faster than your current approach. You might not, you might just find that the current approach is the fastest for your whole computation. If read time is, say, 10% of total computation time you might decide it's a reasonable overhead to live with.

Parsing large file with MPI in C++

Answers (2)

Related Questions