Reputation: 370
Please help me understand how I can improve sequential, unformatted I/O throughput with (G)Fortran, especially when working on NVMe SSDs.
I wrote a little test program, see bottom of this post. What this does is open one or more files in parallel (OpenMP) and write an array of random numbers into it. Then it flushes system caches (root required, otherwise the read test will most likely read from memory) opens the files, and reads from them. Time is measured in wall time (trying to include only I/O-related times), and performance numbers are given in MiB/s. The program loops until aborted.
The hardware I am using for testing is a Samsung 970 Evo Plus 1TB SSD, connected via 2 PCIe 3.0 lanes. So in theory, it should be capable of ~1500MiB/s sequential reads and writes. Testing beforehand with "dd if=/dev/zero of=./testfile bs=1G count=1 oflag=direct" results in ~750MB/s. Not too great, but still better than what I get with Gfortran. And depending on who you ask, dd should not be used for benchmarking anyway. This is just to make sure that the hardware is in theory capable of more.
Results with my code tend to get better with larger file size, but even with 1GiB it caps out at around 200MiB/s write, 420MiB/s read. Using more threads (e.g. 4) increases write speeds a bit, but only to around 270MiB/s. I made sure to keep the benchmark runs short, and give the SSD time to relax between tests.
I was under the impression that it should be possible to saturate 2 PCIe 3.0 lanes worth of bandwidth, even with only a single thread. At least when using unformatted I/O.
The code does not seem to be CPU limited, top shows less than 50% usage on a single core if I move the allocation and initialization of the "values" field out of the loop. Which still does not bode well for overall performance, considering that I would like to see numbers that are at least 5 times higher.
I also tried to use access=stream for the open statements, but to no avail.
So what seems to be the problem?
Is my code wrong/unoptimized? Are my expectations too high?
Platform used:
Opensuse Leap 15.1, Kernel 4.12.14-lp151.28.36-default
2x AMD Epyc 7551, Supermicro H11DSI, Samsung 970 Evo Plus 1TB (2xPCIe 3.0)
gcc version 8.2.1, compiler options: -ffree-line-length-none -O3 -ffast-math -funroll-loops -flto
MODULE types
implicit none
save
INTEGER, PARAMETER :: I8B = SELECTED_INT_KIND(18)
INTEGER, PARAMETER :: I4B = SELECTED_INT_KIND(9)
INTEGER, PARAMETER :: SP = KIND(1.0)
INTEGER, PARAMETER :: DP = KIND(1.0d0)
END MODULE types
MODULE parameters
use types
implicit none
save
INTEGER(I4B) :: filesize ! file size in MiB
INTEGER(I4B) :: nthreads ! number of threads for parallel ececution
INTEGER(I4B) :: alloc_size ! size of the allocated data field
END MODULE parameters
PROGRAM iometer
use types
use parameters
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER(I4B) :: thread, tunit, n
INTEGER(I8B) :: counti, countf, count_rate
REAL(DP) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL(SP), DIMENSION(:), ALLOCATABLE :: values
call system_clock(counti,count_rate)
call getarg(1,directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call getarg(2,filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call getarg(3,nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = real(alloc_size,kind=dp) * real(nthreads,kind=dp) / 1048576.0_dp
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
do while(.true.)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
allocate(values(alloc_size))
call random_seed()
call RANDOM_NUMBER(values)
tunit = thread + 100
!$OMP BARRIER
!$OMP MASTER
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='replace', action='write', form='unformatted')
write(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_write = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
write_speed = mib_written/telapsed_write
!write(*,*) 'write speed (MiB/s): ', write_speed
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
call system_clock(counti)
!$OMP END MASTER
!$OMP BARRIER
open(unit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted')
read(tunit) values
close(unit=tunit)
!$OMP BARRIER
!$OMP MASTER
call system_clock(countf)
telapsed_read = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
!$OMP BARRIER
deallocate(values)
!$OMP END PARALLEL
call sleep(1)
end do
END PROGRAM iometer
Upvotes: 1
Views: 188
Reputation: 37188
The mistake in your code is that in your calculation of mib_written
you have forgotten to take into account the size of a real(sp)
variable (4 bytes). Thus your results are a factor of 4 too low. E.g. calculate it as
mib_written = filesize * nthreads
Some minor nits, some specific to GFortran:
random_seed
, particularly not from each thread. If you want to call it, call it once in the beginning of the program.open(newunit=tunit, ...)
to let the compiler runtime allocate a unique unit number for each file.int64
and real64
from the iso_fortran_env
intrinsic module.alloc_size
of kind int64
.get_command_argument
intrinsic instead of the nonstandard getarg
.access='stream'
is slightly faster than the default (sequential) as there's no need to handle the record length markers.Your test program with these fixes (and the parameters
module folded into the main program) below:
PROGRAM iometer
use iso_fortran_env
use omp_lib
implicit none
CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
CHARACTER(LEN=40) :: dummy_char1
CHARACTER(LEN=110) :: filename
CHARACTER(LEN=10) :: filenumber
INTEGER :: thread, tunit
INTEGER(int64) :: counti, countf, count_rate
REAL(real64) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
REAL, DIMENSION(:), ALLOCATABLE :: values
INTEGER :: filesize ! file size in MiB
INTEGER :: nthreads ! number of threads for parallel ececution
INTEGER(int64) :: alloc_size ! size of the allocated data field
call system_clock(counti,count_rate)
call get_command_argument(1, directory_char)
dummy_char1 = ' directory to test:'
write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))
call get_command_argument(2, filesize_char)
dummy_char1 = ' file size (MiB):'
read(filesize_char,*) filesize
write(*,'(A40,I12)') dummy_char1, filesize
call get_command_argument(3, nthreads_char)
dummy_char1 = ' number of parallel threads:'
read(nthreads_char,*) nthreads
write(*,'(A40,I12)') dummy_char1, nthreads
alloc_size = filesize * 262144_int64
dummy_char1 = ' allocation size:'
write(*,'(A40,I12)') dummy_char1, alloc_size
mib_written = filesize * nthreads
dummy_char1 = ' MiB written:'
write(*, '(A40,g0)') dummy_char1, mib_written
mib_read = mib_written
CALL OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
do while (.true.)
thread = omp_get_thread_num()
write(filenumber,'(I0.10)') thread
filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'
if (.not. allocated(values)) then
allocate(values(alloc_size))
call RANDOM_NUMBER(values)
end if
open(newunit=tunit, file=filename, status='replace', action='write', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
write(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_write = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
write_speed = mib_written/telapsed_write
call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
!$OMP END MASTER
open(newunit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted', access='stream')
!$omp barrier
!$omp master
call system_clock(counti)
!$omp end master
!$omp barrier
read(tunit) values
close(unit=tunit)
!$omp barrier
!$omp master
call system_clock(countf)
telapsed_read = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
read_speed = mib_read/telapsed_read
write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
!$OMP END MASTER
call sleep(1)
end do
!$OMP END PARALLEL
END PROGRAM iometer
Upvotes: 2