Reputation: 11
I have a computational fluid dynamic code where I am coding a parallel read and write implementation. What I want to achieve is for multiple MPI processes to open the same file and write data to it (there is no overlap of data, I use pwrite() with offset information). This seems to be working fine when the two MPI processes are on the same computing node. However, when I use 2 or more computing nodes, some of the data does not reach the hard-drive. To demonstrate this, I have written the following C program which I compile using mpicc (my MPI distribution is MPICH):
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
long _numbering(long i,long j,long k, long N) {
return (((i-1)*N+(j-1))*N+(k-1));
}
int main(int argc, char **argv)
{
int numranks, rank,fd,dd;
long i,j,k,offset,N;
double value=1.0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numranks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
N=10;
offset=rank*N*N*N*sizeof(double);
fd=-1;
printf("Opening file datasetparallel.dat\n");
//while(fd==-1) {fd = open("datasetparallel.dat", O_RDWR | O_CREAT | O_SYNC,0666);}
while(fd==-1) {fd = open("datasetparallel.dat", O_RDWR | O_CREAT,0666);}
//while(dd==-1) {fd = open("/homeA/Desktop/", O_RDWR ,0666);}
for(i=1;i<=N;i++) {
for(j=1;j<=N;j++) {
for(k=1;k<=N;k++) {
if(pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset)!=8) perror("datasetparallel.dat");
//pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset);
value=value+1.0;
}
}
}
//if(close(fd)==-1) perror("datasetparallel.dat");
fsync(fd); //fsync(dd);
close(fd); //close(dd);
printf("Done writing in parallel\n");
if(rank==0) {
printf("Beginning serial write\n");
int ranknum;
fd=-1;
value=1.0;
while(fd==-1) {fd = open("datasetserial.dat", O_RDWR | O_CREAT,0666);}
for(ranknum=0;ranknum<numranks;ranknum++){
offset=ranknum*N*N*N*sizeof(double); printf("Offset for rank %d is %ld\n",ranknum,offset);
printf("writing for rank=%d\n",ranknum);
for(i=1;i<=N;i++) {
for(j=1;j<=N;j++) {
for(k=1;k<=N;k++) {
if(pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset)!=8) perror("datasetserial.dat");
//pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset);
value=value+1.0;
}
}
}
value=1.0;
}
//if(close(fd)==-1) perror("datasetserial.dat");
fsync(fd);
close(fd);
printf("Done writing in serial\n");
}
MPI_Finalize();
return 0;
}
The above program writes doubles in ascending sequence to a file. Each MPI process writes the same numbers(1.0 to 1000.0) but to different regions of the file. For example, rank 0 writes 1.0 to 1000.0, and rank 1 writes 1.0 to 1000.0 beginning from the location just after rank 0 wrote 1000.0 . The program outputs a file named datasetparallel.dat which has been written through concurrent pwrite()s. It also outputs datasetserial.dat for reference to compare with the datasetparallel.dat file to check its integrity (I do this by using the cmp command in the terminal). When a discrepancy is found using cmp, I check the contents of the files using the od command:
od -N <byte_number> -tfD <file_name>
For example, I found some missing data (holes in the file) using the above program. In the parallelly written file, the output using od
command :
.
.
.
0007660 503 504
0007700 505 506
0007720 507 508
0007740 509 510
0007760 511 512
0010000 0 0
*
0010620 0
0010624
while in the reference file written in serial, the output from the od
command:
.
.
.
0007760 511 512
0010000 513 514
0010020 515 516
0010040 517 518
0010060 519 520
0010100 521 522
0010120 523 524
0010140 525 526
0010160 527 528
0010200 529 530
0010220 531 532
0010240 533 534
0010260 535 536
0010300 537 538
0010320 539 540
0010340 541 542
0010360 543 544
0010400 545 546
0010420 547 548
0010440 549 550
0010460 551 552
0010500 553 554
0010520 555 556
0010540 557 558
0010560 559 560
0010600 561 562
0010620 563 564
.
.
.
So far, the only way to fix this seems to use the POSIX open() function with the O_SYNC flag, which ensures that file is written physically to the hard drive, but this seems to be impractically slow. Another equally slow approach seems to be using the inbuilt MPI I/O commands. I am not sure why MPI I/O is slow either. The storage has been mounted on NFS using the following flags: rw,nohide,insecure,no_subtree_check,sync,no_wdelay
.I have tried calling fsync() on the file and the directory to no avail. Thus, I need advice on how to fix this.
Upvotes: 0
Views: 341
Reputation: 5223
NFS is a horrible file system. As you have seen, its caching behavior makes it trivially easy for processes to "false share" a cached block and then corrupt data.
If you are stuck with NFS, do the compute in parallel but then do all the I/O from one rank.
A true parallel system like OrangeFS/PVFS (http://www.orangefs.org) will help immensely here, especially if start using MPI-IO (you are already using MPI, so you're halfway there!). Lustre is another option. OrangeFS is the simpler of the two to configure, but I am maybe biased since I used to work on it.
It's absolutely possible to address random memory in collective I/O. All your data is MPI_DOUBLE so all you need to do is describe the regions with at worst MPI_TYPE_CREATE_HINDEXED and provide the addresses. You'll see a huge increase in performance if for no other reason than you will be issuing one MPI-IO call instead of (if N == 10) 1000. Your data is contiguous in file so you don't even have to worry about file views.
Furthermore, remember how I said "do all your I/O from one process?". this is a little more advanced but if you set the "cb_nodes" hint (how many nodes to use for the "collective buffering" optimization) to 1 MPI-IO will do just that for you.
Upvotes: 1