Using dynamic arrays in OpenMP loop gives different results of the serial

Question

The following function compute the square matrix arrayrmsd, but the result is different depending on the value of export OMP_NUM_THREADS=x (x = number of cores used). Differences are low but significant and cannot be due to numerical approximations (IMHO). The arrayrmsd is a matrix of nmol x nmol elements. Note: arrayrmsd[i,j] == arrayrmsd[i x nmol + j].

What I wanted to do is to run in parallel the optimize() function, which is in the internal for loop. This function compares the molecular structure mol1c with all the structures contained in arrcar[] and store the numerical difference in arrayrmsd[i,j]. Because the first parameter of the optimize() function is modified, I need to have an array of mol_omp[0..numthr] that should be used exclusively by the assigned thread.
The idea here is to run in parallel only the for loop that contains the optimize() function (the code after #pragma omp for).
Before the #pragma omp parallel I initialized the corresponding mol_omp[i] for all the threads.

I am new to OpenMP, and I cannot understand where my code is wrong:

    void create_rmsd_matrix(double *arrayrmsd, int nmol, CRISTALLO *mol1c, CRISTALLO *arrcar, double tolerance) {
      int i, k, numthr;
      double resol;
      CRISTALLO *mol_omp;

      for (i=0; i < nmol; i++) for (k=0; k < nmol; k++) arrayrmsd[i * nmol + k] = (double) 0.0;
      numthr = omp_get_max_threads();
      mol_omp = (CRISTALLO *) malloc(sizeof(*mol_omp) * numthr);
      for (i=0; i < numthr; i++) {
        (mol_omp + i)->natoms = arrcar->natoms;
        (mol_omp + i)->num_symbols = arrcar->num_symbols;
        initmol(mol_omp + i);
      }
    #pragma omp parallel private(i,k) shared(nmol,arrayrmsd,arrcar,tolerance)
    {
      int tid = omp_get_thread_num();
    #pragma omp for private(i,k) schedule(static,1)
      for (i=0; i < nmol; i++) {
        molcpy(mol_omp + tid, arrcar + i);
        printf("Processing struc N. %d by thread n %d
", i+1, tid);
        for (k = i + 1; k < nmol; k++) {
          arrayrmsd[i * nmol + k] = optimize(mol_omp + tid, arrcar + k, tolerance);
          printf("[ %5d , %5d ] = %.12lf     (thread %d)
", i+1, k+1, arrayrmsd[i * nmol + k], tid);
        }
      }
    }
      for (i=0; i < numthr; i++) {
        free((mol_omp + i)->list_atomi);
        free((mol_omp + i)->num_atomi);
        free((mol_omp + i)->x);
        free((mol_omp + i)->y);
        free((mol_omp + i)->z);
      }
    }

There is a similar question, but I cannot understand whether it can be applied to my case. Here the main difference is that I have a dynamic array that every thread must use independently.
Probably the problem reside in data race, but I cannot figure out where, because all the element of the arrayrmsd can be computed independently.
Where I am wrong? Thank you!

I tried several #pragma omp parallel options, but honestly I'm lacking of knowledge in OpenMP.
The serial code and the parallel code run with OMP_NUM_THREADS=1 gave identical results, but if OMP_NUM_THREADS>1 the content of the arrayrmsd matrix differs.

Edit: arrcar array is not modified. molcpy(mol1, mol2) simply copy the molecular structure mol2 into mol1:

    void molcpy(CRISTALLO *mol1, CRISTALLO *mol2) {
      int i;
    
      mol1->natoms = mol2->natoms;
      mol1->tipo_coordinate = mol2->tipo_coordinate;
      strcpy(mol1->comment, mol2->comment);
      for (i=0; i < 3; i++) {
        mol1->v1[i] = mol2->v1[i];
        mol1->v2[i] = mol2->v2[i];
        mol1->v3[i] = mol2->v3[i];
      }
      mol1->scale = mol2->scale;
      mol1->num_symbols = mol2->num_symbols;
      for (i=0; i < mol1->num_symbols; i++) strcpy(&mol1->list_atomi[ i * 3 ], &mol2->list_atomi[ i * 3 ]);
      for (i=0; i < mol2->num_symbols; i++) mol1->num_atomi[i] = mol2->num_atomi[i];
      for (i=0; i < mol2->natoms; i++) {
        mol1->x[i] = mol2->x[i];
        mol1->y[i] = mol2->y[i];
        mol1->z[i] = mol2->z[i];
      }
    }

UPDATE!: UPDATE!: UPDATE!: For the readers: The code I posted above is correct and the #pragma directives are correct. As suspected I found a very subtle data race bug in one of the many functions that optimize calls. Sorry guys for that, and thanks to everybody for comments. It pushed me to find the solution. I'm sorry I cannot post all the code: it is huge.

Using dynamic arrays in OpenMP loop gives different results of the serial

Answers (0)

Related Questions