Reputation: 130
I have a peculiar issue with MPI, where an array which is not referenced in the MPI command is being overwritten---there is some sort of memory issue going on.
In the first case of gatherv, mpi works as expected. On the second call of gatherv, information from the first array, is affected!
The code I am working in is quite large, however I have created a standalone program that roughly shows the issue.
However, in the smaller program, while there is still a problem, the program calls a seg fault, instead of just continuing on like the larger program does.
program main
use mpi
integer :: chunksize, send_count, i_start, i_end
integer, allocatable :: rec_starts(:), rec_counts(:)
integer, parameter :: dp = 8; ! double precision
REAL(DP), allocatable:: array_2d(:,:)
REAL(DP), allocatable:: array_3d(:,:,:)
INTEGER, parameter:: num_skill=5, num_pref=2
INTEGER, parameter:: num_ed=3, num_children=2, num_age=4, num_market=28, num_health=2, num_year=2
INTEGER, parameter:: num_total_state_m=num_children*num_market*num_year*num_ed*num_age*num_health*num_ed*num_age*num_health
real(dp), dimension(num_skill,num_total_state_m) :: array_2d_local
real(dp), dimension(num_pref,num_pref,num_total_state_m) :: array_3d_local
integer i,j,k,l,m
!mpi vars
integer :: ierr, ntasks, mpi_id
! Set up MPI
call mpi_init(ierr)
call mpi_comm_size(mpi_comm_world, ntasks, ierr) !get number of tasks
call mpi_comm_rank(mpi_comm_world, mpi_id, ierr) !get id of each task
write(*,*) 'process ', mpi_id+1, 'of ', ntasks, 'is alive,', ' mpi_id:',mpi_id
!calculate which 'i' this thread is responsible for
chunksize = (num_total_state_m + ntasks - 1) / ntasks !note int/int rounds down
i_start = (mpi_id)*chunksize + 1
i_end = min((mpi_id+1)*chunksize,num_total_state_m)
!set up practice matrices
allocate(array_2d(num_skill,num_total_state_m), &
array_3d(num_pref,num_pref,num_total_state_m))
l = 1
m = -1
do i=1,num_skill
do j=1, num_total_state_m
if (mpi_id==0) array_2d_local(i,j) = l
if (mpi_id==1) array_2d_local(i,j) = m
l = l + 1
m = m - 1
end do
end do
l = 1
m = -1
do i=1, num_pref
do j=1, num_pref
do k=1, num_total_state_m
if (mpi_id==0) array_3d_local(i,j,k) = l
if (mpi_id==1) array_3d_local(i,j,k) = m
l = l + 1
m = m - 1
end do
end do
end do
! Next send matricies
allocate(rec_starts(ntasks), rec_counts(ntasks))
do i=1, ntasks
rec_counts(i) = min(num_total_state_m, i * chunksize) - (i-1)*chunksize
rec_starts(i) = (i-1) * chunksize
end do
rec_counts = rec_counts * num_skill
rec_starts = rec_starts * num_skill
send_count = rec_counts(mpi_id+1)
! -m (dimensions:num_skill, num_total_state_m) double
call mpi_gatherv(array_2d_local(:,i_start:i_end), send_count, &
mpi_double_precision, &
array_2d, rec_counts, rec_starts, mpi_double_precision, &
0, mpi_comm_world, ierr)
! Next do 3d array
! IF THESE LINES ARE UNCOMMENTED, THE PROGRAM WORKS FINE!
!do i=1, ntasks
! rec_counts(i) = min(num_total_state_m, i * chunksize) - (i-1)*chunksize
! rec_starts(i) = (i-1) * chunksize
!end do
rec_counts = rec_counts * num_pref
rec_starts = rec_starts * num_pref
send_count = rec_counts(mpi_id+1)
! -array_3d (num_pref,num_pref,num_total_state_m)double
print*, array_2d(1,1), mpi_id, 'before'
call mpi_gatherv(array_3d_local(:,:,i_start:i_end), send_count, &
mpi_double_precision, &
array_3d, rec_counts, rec_starts, mpi_double_precision, &
0, mpi_comm_world, ierr)
print*, array_2d(1,1), mpi_id, 'after'
deallocate(rec_starts, rec_counts)
deallocate(array_2d, array_3d)
end program main
Output in this smaller program looks like this:
mpifort -fcheck=all -fbacktrace -g -Og -ffree-line-length-2048 main.f90 -o run_main
mpiexec -np 2 run_main 2>&1 | tee run_main.log
process 1 of 2 is alive, mpi_id: 0
process 2 of 2 is alive, mpi_id: 1
1.0000000000000000 0 before
0.0000000000000000 1 before
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x101e87579
#1 0x101e86945
#2 0x7fff6a9ecb5c
In the larger program, where the program does not segfault, the printed output looks something like this
1.0000000000000000 0 before
0.0000000000000000 1 before
-1.9018063100806379 0 after
0.0000000000000000 1 after
I have been looking at other SO posts: MPI_Recv overwrites parts of memory it should not access MPI_Recv overwrites parts of memory it should not access
but, as a non-expert in fortran/mpi, unfortunately the replies to these posts are not quite enough for me to understand the issue.
Any assistance or insight is much appreciated. Thanks!
Edit: Thanks, just me being an idiot. If someone else comes across this, triple check your recvcounts
and displs
!
Upvotes: 0
Views: 119
Reputation: 8395
Your initial code does
do i=1, ntasks
rec_counts(i) = min(num_total_state_m, i * chunksize) - (i-1)*chunksize
rec_starts(i) = (i-1) * chunksize
end do
rec_counts = rec_counts * num_skill
rec_starts = rec_starts * num_skill
send_count = rec_counts(mpi_id+1)
and then
rec_counts = rec_counts * num_pref
rec_starts = rec_starts * num_pref
send_count = rec_counts(mpi_id+1)
you simply forgot to divide by num_skill
.
a trivial fix is to replace the last three lines with
rec_counts = rec_counts * num_pref / num_skill
rec_starts = rec_starts * num_pref / num_skill
send_count = rec_counts(mpi_id+1)
If you ever suspect a bug in the MPI library, a good practice is to try an other one (e.g. MPICH (derivative) and Open MPI). If your application crashes with both, then the odds are the bug is in your app.
Upvotes: 2