Gopal
Gopal

Reputation: 785

Parallel Reading/Writing File in c

Problem is to read a file of size about 20GB simultaneously by n processes. File contains one string at each line and Length of the strings may or may not be same. String length can be at-most 10 bytes long.

I have a cluster of having 16 nodes. Each node are the uni-processor and having 6GB RAM.I am using MPI to write Parallel codes.

What are the efficient way to partition this big file so that all resources can be utilized ?

Note: The constraints to the partitions is to read file as a chunk of fixed number of lines. Assume file contains 1600 lines(e.g. 1600 strings). then first process should read from 1st line to 100th line, second process should do from 101th line to 200th line and so on....

As i think that one can't read a file by more than one processes at a time because we have only one file handler that point to somewhere only one string. then how other processes can read parallely from different chunks?

Upvotes: 3

Views: 5611

Answers (4)

Martlark
Martlark

Reputation: 14591

Here is a function in python using mpi and the pypar extension to read the number of lines in a big file using mpi to split up the duties amongst a number of hosts.

def getFileLineCount( file1 ):
    import pypar, mmap, os
    """
    uses pypar and mpi to speed up counting lines
    parameters:
        file1 - the file name to count lines
    returns:
        (line count)
    """

    p1 = open( file1, "r" )
    f1 = mmap.mmap( p1.fileno(), 0, None, mmap.ACCESS_READ )

    #work out file size
    fSize = os.stat( file1 ).st_size
    #divide up to farm out line counting
    chunk = ( fSize / pypar.size() ) + 1

    lines = 0
    #set start and end locations
    seekStart = chunk * ( pypar.rank() )
    seekEnd = chunk * ( pypar.rank() + 1 )
    if seekEnd > fSize:
        seekEnd = fSize

    #find start of next line after chunk
    if pypar.rank() > 0:
        f1.seek( seekStart )
        l1 = f1.readline()
        seekStart = f1.tell()

    #tell previous rank my seek start to make their seek end
    if pypar.rank() > 0:
#        logging.info( 'Sending to %d, seek start %d' % ( pypar.rank() - 1, seekStart ) )
        pypar.send( seekStart, pypar.rank() - 1 )
    if pypar.rank() < pypar.size() - 1:
        seekEnd = pypar.receive( pypar.rank() + 1 )
#        logging.info( 'Receiving from %d, seek end %d' % ( pypar.rank() + 1, seekEnd ) )

    f1.seek( seekStart )

    logging.info( 'Calculating line lengths and positions from file byte %d to %d' % ( seekStart, seekEnd ) )

    l1 = f1.readline()
    prevLine = l1

    while len( l1 ) > 0:
        lines += 1

        l1 = f1.readline()
        if f1.tell() > seekEnd or len( l1 ) == 0:
            break

        prevLine = l1
    #while
    f1.close()
    p1.close()

    if pypar.rank() == 0:
        logging.info( 'Receiving line info' )
        for p in range( 1, pypar.size() ):
            lines += pypar.receive( p )
    else:
        logging.info( 'Sending my line info' )
        pypar.send( lines, 0 )

    lines = pypar.broadcast( lines )
    return ( lines )

Upvotes: 0

Jonathan Dursi
Jonathan Dursi

Reputation: 50947

So as you're discovering, text file formats are poor for dealing with large amounts of data; not only are they larger than binary formats, but you run into formatting problems like here (seaching for newlines), and everything is much slower (data must be converted into strings). There can easily be 10x difference in IO speeds between text-based formats and binary formats for numerical data. But we'll assume for now you're stuck with the text file format.

Presumably, you're doing this partitioning for speed. But unless you have a parallel filesystem -- that is, multiple servers serving from multiple disks, and a FS that can keep those coordinated -- it's unlikely you're going to get a significant speedup from having multiple MPI tasks reading from the same file, as ultimately these requests are all going to get serialized anyway at the server/controller/disk level.

Further, reading in large blocks of data is going to be much faster than fseek()ing around and doing small reads looking for newlines.

So my suggestion would be to have one process (perhaps the last) read all the data in as few chunks as it can and send the relevant lines to each task (including, finally, itself). If you know how many lines the file has at the start, this is fairly simple; read in say 2 GB of data, search through memory for the end of the N/Pth line, and send that to task 0, send task 0 a "completed your data" message, and continue.

Upvotes: 4

Gangnus
Gangnus

Reputation: 24484

I think it would be better to write a piece of code that would get line lengths and distribute lines to processes. That distributing function would work not with strings themselves, but only their lengths.

To find an algorythm for even distribution of sources of fixed size is not a problem.

And after that the distributing func will tell other processes what pieces they have to get for work. Process 0 (distributor) will read a line. It already knows, that the line num. 1 should be worked by the process 1. ... P.0 reads line num. N and knows what process has to work with it.

Oh! We needn't optimize the distribution from the start. Simply the distributor process reads a new line from input and gives it to a free process. That's all.

So, you have even two solutions: heavily optimized and easy one.

We could reach even more optimalization if the distributor process will reoptimalize the unread yet strings from time to time.

Upvotes: 0

NPE
NPE

Reputation: 500913

You don't specify if there are any constraints on the partitions, so I'll assume there are none. I'll also assume that you want the partitions to be as close to equal in size as possible.

The naïve approach would be to split the file into chunks of size 20GB/n. The starting position of chunk i wouild be i*20GB/n for i=0..n-1.

The problem with that is, of course, that there's no guarantee that chunk boundaries would fall between the lines of the input file. In general, they won't.

Fortunately, there's an easy way to correct for this. Having established the boundaries as above, shift them slightly so that each of them (except i=0) is placed after the following newline.

That'll involve reading 15 small fragments of the file, but will result in a very even partition.

In fact, the correction can be done by each node individually, but it's probably not worth complicating the explanation with that.

Upvotes: 0

Related Questions