Reputation: 1038
I'm trying to learn MPI by writing a program to calculate a coefficient. However my program actually slows down after implementing MPI. Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <mpi.h>
#define aSize 2000000
double stan_dev_mpi(double stan_array[], double stan_mean){
double a = 0;
double atemp = 0;
for (int i=0; i<aSize; i++){
a = a + pow((stan_array[i]-stan_mean), 2);
}
MPI_Allreduce(&a, &atemp, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
a = a/aSize;
a = sqrt(a);
return a;
}
double mean(double* mean_array){
double mean = 0;
for (int i=0; i<aSize; i++){
mean = mean + mean_array[i];
}
mean = mean/aSize;
return mean;
}
int pearson_par(void){
int comm_sz;
int my_rank;
double mean_a;
double mean_b;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
double *a;
a = malloc(sizeof(double)*aSize);
double *b;
b = malloc(sizeof(double)*aSize);
for (int i=0; i<aSize; i++){
a[i] = sin(i);
b[i] = sin(i+2);
}
clock_t begin, end;
double time_spent;
begin = clock();
double *buffera = (double *)malloc(sizeof(double) * (aSize/comm_sz));
double *bufferb = (double *)malloc(sizeof(double) * (aSize/comm_sz));
MPI_Scatter(a, aSize/comm_sz, MPI_DOUBLE, buffera, aSize/comm_sz, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Scatter(b, aSize/comm_sz, MPI_DOUBLE, bufferb, aSize/comm_sz, MPI_DOUBLE, 0, MPI_COMM_WORLD);
mean_a = mean(a);
mean_b = mean(a);
double stan_dev_a = stan_dev_mpi(a, mean_a);
double stan_dev_b = stan_dev_mpi(b, mean_b);
double pearson_numer;
double pearson_numer_temp;
for(int i=0; i<aSize; i++){
pearson_numer = pearson_numer + ((a[i]-mean_a)*(b[i]-mean_b));
}
MPI_Allreduce(&pearson_numer, &pearson_numer_temp, 1, MPI_DOUBLE, MPI_SUM,
MPI_COMM_WORLD);
pearson_numer = pearson_numer/aSize;
double pearson_coef = pearson_numer/(stan_dev_a*stan_dev_b);
if(my_rank == 0){
printf("%s %G\n", "The Pearson Coefficient is: ", pearson_coef);
}
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
if(my_rank == 0){
printf("%lf %s\n", time_spent, "sec");
}
MPI_Finalize();
free(a);
free(b);
return 0;
}
int main(void) {
pearson_par();
return 0;
}
And if I run it with 4 processes I get a run time of 0.06 compared to 0.03 when ran with the sequential version. I'm new to MPI so I'm not sure what is causing the problem. Any help would be appreciated.
Upvotes: 1
Views: 464
Reputation: 9489
The main issue I see here is that you don't distribute your work, you replicate it across processes. So the more processes you get, the more work you do overall. Actually, best case scenario for your code would be to have a flat time irrespective of the number of MPI processes...
But since in addition your code does very few computation for a lot of memory accesses (very low arithmetic intensity), you are likely memory bound. So increasing the number of MPI processes (and the overall workload) increases the pressure on the memory bandwidth (which is a shared resource across cores and thereafter MPI processes), what you experience is, instead of a flat time, an increase in time...
If you want to have a chance of seeing any sort of speed-up in your code, you'll have to actually distribute the work instead of replicating it. This will translate into computing on your buffera
and bufferb
data rather than a
and b
(well a
actually here which is yet another bug).
Upvotes: 2