Concurrent non-overlapping pwrite() to a file mounted on NFS using multiple MPI processes

Question

I have a computational fluid dynamic code where I am coding a parallel read and write implementation. What I want to achieve is for multiple MPI processes to open the same file and write data to it (there is no overlap of data, I use pwrite() with offset information). This seems to be working fine when the two MPI processes are on the same computing node. However, when I use 2 or more computing nodes, some of the data does not reach the hard-drive. To demonstrate this, I have written the following C program which I compile using mpicc (my MPI distribution is MPICH):

#include 
#include 
#include 
#include 

long _numbering(long i,long j,long k, long N) {
  return (((i-1)*N+(j-1))*N+(k-1));
}

int main(int argc, char **argv)
{
  int   numranks, rank,fd,dd;
  long i,j,k,offset,N;
  double value=1.0;
  MPI_Init(NULL,NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &numranks);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  N=10;
  offset=rank*N*N*N*sizeof(double);
  fd=-1;
  printf("Opening file datasetparallel.dat
");
  //while(fd==-1) {fd = open("datasetparallel.dat", O_RDWR | O_CREAT | O_SYNC,0666);}
  while(fd==-1) {fd = open("datasetparallel.dat", O_RDWR | O_CREAT,0666);}
  //while(dd==-1) {fd = open("/homeA/Desktop/", O_RDWR ,0666);}

  for(i=1;i<=N;i++) {
    for(j=1;j<=N;j++) {
      for(k=1;k<=N;k++) {
        if(pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset)!=8) perror("datasetparallel.dat");
        //pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset);
        value=value+1.0;
      }
    }
  }
  //if(close(fd)==-1) perror("datasetparallel.dat");
  fsync(fd); //fsync(dd);
  close(fd); //close(dd);
 
  printf("Done writing in parallel
");
  if(rank==0) {
    printf("Beginning serial write
");
    int ranknum;
    fd=-1;
    value=1.0;
    while(fd==-1) {fd = open("datasetserial.dat", O_RDWR | O_CREAT,0666);}
    for(ranknum=0;ranknum


The above program writes doubles in ascending sequence to a file. Each MPI process writes the same numbers(1.0 to 1000.0) but to different regions of the file. For example, rank 0 writes 1.0 to 1000.0, and rank 1 writes 1.0 to 1000.0 beginning from the location just after rank 0 wrote 1000.0 . The program outputs a file named datasetparallel.dat which has been written through concurrent pwrite()s. It also outputs datasetserial.dat for reference to compare with the datasetparallel.dat file to check its integrity (I do this by using the cmp command in the terminal). When a discrepancy is found using cmp, I check the contents of the files using the od command:
od -N  -tfD 

For example, I found some missing data (holes in the file) using the above program. In the parallelly written file, the output using od command :
.
.
.
0007660                      503                      504
0007700                      505                      506
0007720                      507                      508
0007740                      509                      510
0007760                      511                      512
0010000                        0                        0
*
0010620                        0
0010624

while in the reference file written in serial, the output from the od command:
.
.
.
0007760                      511                      512
0010000                      513                      514
0010020                      515                      516
0010040                      517                      518
0010060                      519                      520
0010100                      521                      522
0010120                      523                      524
0010140                      525                      526
0010160                      527                      528
0010200                      529                      530
0010220                      531                      532
0010240                      533                      534
0010260                      535                      536
0010300                      537                      538
0010320                      539                      540
0010340                      541                      542
0010360                      543                      544
0010400                      545                      546
0010420                      547                      548
0010440                      549                      550
0010460                      551                      552
0010500                      553                      554
0010520                      555                      556
0010540                      557                      558
0010560                      559                      560
0010600                      561                      562
0010620                      563                      564
.
.
.

So far, the only way to fix this seems to use the POSIX open() function with the O_SYNC flag, which ensures that file is written physically to the hard drive, but this seems to be impractically slow. Another equally slow approach seems to be using the inbuilt MPI I/O commands. I am not sure why MPI I/O is slow either. The storage has been mounted on NFS using the following flags: rw,nohide,insecure,no_subtree_check,sync,no_wdelay .I have tried calling fsync() on the file and the directory to no avail. Thus, I need advice on how to fix this.

Concurrent non-overlapping pwrite() to a file mounted on NFS using multiple MPI processes

Answers (1)

Related Questions