MechEng
MechEng

Reputation: 370

GFortran unformatted I/O throughput on NVMe SSDs

Please help me understand how I can improve sequential, unformatted I/O throughput with (G)Fortran, especially when working on NVMe SSDs.

I wrote a little test program, see bottom of this post. What this does is open one or more files in parallel (OpenMP) and write an array of random numbers into it. Then it flushes system caches (root required, otherwise the read test will most likely read from memory) opens the files, and reads from them. Time is measured in wall time (trying to include only I/O-related times), and performance numbers are given in MiB/s. The program loops until aborted.

The hardware I am using for testing is a Samsung 970 Evo Plus 1TB SSD, connected via 2 PCIe 3.0 lanes. So in theory, it should be capable of ~1500MiB/s sequential reads and writes. Testing beforehand with "dd if=/dev/zero of=./testfile bs=1G count=1 oflag=direct" results in ~750MB/s. Not too great, but still better than what I get with Gfortran. And depending on who you ask, dd should not be used for benchmarking anyway. This is just to make sure that the hardware is in theory capable of more.

Results with my code tend to get better with larger file size, but even with 1GiB it caps out at around 200MiB/s write, 420MiB/s read. Using more threads (e.g. 4) increases write speeds a bit, but only to around 270MiB/s. I made sure to keep the benchmark runs short, and give the SSD time to relax between tests.

I was under the impression that it should be possible to saturate 2 PCIe 3.0 lanes worth of bandwidth, even with only a single thread. At least when using unformatted I/O. The code does not seem to be CPU limited, top shows less than 50% usage on a single core if I move the allocation and initialization of the "values" field out of the loop. Which still does not bode well for overall performance, considering that I would like to see numbers that are at least 5 times higher.
I also tried to use access=stream for the open statements, but to no avail.

So what seems to be the problem?
Is my code wrong/unoptimized? Are my expectations too high?

Platform used:
Opensuse Leap 15.1, Kernel 4.12.14-lp151.28.36-default
2x AMD Epyc 7551, Supermicro H11DSI, Samsung 970 Evo Plus 1TB (2xPCIe 3.0)
gcc version 8.2.1, compiler options: -ffree-line-length-none -O3 -ffast-math -funroll-loops -flto

MODULE types
    implicit none
    save

    INTEGER, PARAMETER  :: I8B = SELECTED_INT_KIND(18)
    INTEGER, PARAMETER  :: I4B = SELECTED_INT_KIND(9)
    INTEGER, PARAMETER  :: SP = KIND(1.0)
    INTEGER, PARAMETER  :: DP = KIND(1.0d0)

END MODULE types

MODULE parameters
    use types
    implicit none
    save

    INTEGER(I4B) :: filesize ! file size in MiB
    INTEGER(I4B) :: nthreads ! number of threads for parallel ececution
    INTEGER(I4B) :: alloc_size ! size of the allocated data field

END MODULE parameters



PROGRAM iometer
    use types
    use parameters
    use omp_lib

    implicit none

    CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
    CHARACTER(LEN=40)  :: dummy_char1
    CHARACTER(LEN=110) :: filename
    CHARACTER(LEN=10)  :: filenumber
    INTEGER(I4B) :: thread, tunit, n
    INTEGER(I8B) :: counti, countf, count_rate
    REAL(DP) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
    REAL(SP), DIMENSION(:), ALLOCATABLE :: values

    call system_clock(counti,count_rate)

    call getarg(1,directory_char)
    dummy_char1 = ' directory to test:'
    write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))

    call getarg(2,filesize_char)
    dummy_char1 = ' file size (MiB):'
    read(filesize_char,*) filesize
    write(*,'(A40,I12)') dummy_char1, filesize

    call getarg(3,nthreads_char)
    dummy_char1 = ' number of parallel threads:'
    read(nthreads_char,*) nthreads
    write(*,'(A40,I12)') dummy_char1, nthreads

    alloc_size = filesize * 262144

    dummy_char1 = ' allocation size:'
    write(*,'(A40,I12)') dummy_char1, alloc_size

    mib_written = real(alloc_size,kind=dp) * real(nthreads,kind=dp) / 1048576.0_dp
    mib_read = mib_written

    CALL OMP_SET_NUM_THREADS(nthreads)
    do while(.true.)
        !$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)

        thread = omp_get_thread_num()
        write(filenumber,'(I0.10)') thread
        filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'

        allocate(values(alloc_size))
        call random_seed()
        call RANDOM_NUMBER(values)
        tunit = thread + 100

        !$OMP BARRIER
        !$OMP MASTER
        call system_clock(counti)
        !$OMP END MASTER
        !$OMP BARRIER

        open(unit=tunit, file=trim(adjustl(filename)), status='replace', action='write', form='unformatted')
        write(tunit) values
        close(unit=tunit)

        !$OMP BARRIER
        !$OMP MASTER
        call system_clock(countf)
        telapsed_write = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
        write_speed = mib_written/telapsed_write
        !write(*,*) 'write speed (MiB/s): ', write_speed
        call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)
        call system_clock(counti)
        !$OMP END MASTER
        !$OMP BARRIER

        open(unit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted')
        read(tunit) values
        close(unit=tunit)

        !$OMP BARRIER
        !$OMP MASTER
        call system_clock(countf)
        telapsed_read = real(countf-counti,kind=dp)/real(count_rate,kind=dp)
        read_speed = mib_read/telapsed_read
        write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
        !$OMP END MASTER
        !$OMP BARRIER
        deallocate(values)
        !$OMP END PARALLEL

        call sleep(1)

    end do

END PROGRAM iometer

Upvotes: 1

Views: 188

Answers (1)

janneb
janneb

Reputation: 37188

The mistake in your code is that in your calculation of mib_written you have forgotten to take into account the size of a real(sp) variable (4 bytes). Thus your results are a factor of 4 too low. E.g. calculate it as

mib_written = filesize * nthreads

Some minor nits, some specific to GFortran:

  • Don't repeatedly call random_seed, particularly not from each thread. If you want to call it, call it once in the beginning of the program.
  • You can use open(newunit=tunit, ...) to let the compiler runtime allocate a unique unit number for each file.
  • If you want the 'standard' 64-bit integer/floating point kinds, you can use the variables int64 and real64 from the iso_fortran_env intrinsic module.
  • For testing with larger files, you need to make alloc_size of kind int64.
  • Use the standard get_command_argument intrinsic instead of the nonstandard getarg.
  • access='stream' is slightly faster than the default (sequential) as there's no need to handle the record length markers.

Your test program with these fixes (and the parameters module folded into the main program) below:

PROGRAM iometer
  use iso_fortran_env
  use omp_lib

  implicit none

  CHARACTER(LEN=100) :: directory_char, filesize_char, nthreads_char
  CHARACTER(LEN=40)  :: dummy_char1
  CHARACTER(LEN=110) :: filename
  CHARACTER(LEN=10)  :: filenumber
  INTEGER :: thread, tunit
  INTEGER(int64) :: counti, countf, count_rate
  REAL(real64) :: telapsed_read, telapsed_write, mib_written, write_speed, mib_read, read_speed
  REAL, DIMENSION(:), ALLOCATABLE :: values

  INTEGER :: filesize ! file size in MiB
  INTEGER :: nthreads ! number of threads for parallel ececution
  INTEGER(int64) :: alloc_size ! size of the allocated data field


  call system_clock(counti,count_rate)

  call get_command_argument(1, directory_char)
  dummy_char1 = ' directory to test:'
  write(*,'(A40,A)') dummy_char1, trim(adjustl(directory_char))

  call get_command_argument(2, filesize_char)
  dummy_char1 = ' file size (MiB):'
  read(filesize_char,*) filesize
  write(*,'(A40,I12)') dummy_char1, filesize

  call get_command_argument(3, nthreads_char)
  dummy_char1 = ' number of parallel threads:'
  read(nthreads_char,*) nthreads
  write(*,'(A40,I12)') dummy_char1, nthreads

  alloc_size = filesize * 262144_int64

  dummy_char1 = ' allocation size:'
  write(*,'(A40,I12)') dummy_char1, alloc_size

  mib_written = filesize * nthreads
  dummy_char1 = ' MiB written:'
  write(*, '(A40,g0)') dummy_char1, mib_written
  mib_read = mib_written

  CALL OMP_SET_NUM_THREADS(nthreads)
  !$OMP PARALLEL default(shared) private(thread, filename, filenumber, values, tunit)
  do while (.true.)
     thread = omp_get_thread_num()
     write(filenumber,'(I0.10)') thread
     filename = trim(adjustl(directory_char)) // '/' // trim(adjustl(filenumber)) // '.temp'

     if (.not. allocated(values)) then
        allocate(values(alloc_size))
        call RANDOM_NUMBER(values)
     end if

     open(newunit=tunit, file=filename, status='replace', action='write', form='unformatted', access='stream')
     !$omp barrier
     !$omp master
     call system_clock(counti)
     !$omp end master
     !$omp barrier
     write(tunit) values
     close(unit=tunit)
     !$omp barrier
     !$omp master
     call system_clock(countf)

     telapsed_write = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
     write_speed = mib_written/telapsed_write
     call execute_command_line ('echo 3 > /proc/sys/vm/drop_caches', wait=.true.)

     !$OMP END MASTER

     open(newunit=tunit, file=trim(adjustl(filename)), status='old', action='read', form='unformatted', access='stream')
     !$omp barrier
     !$omp master
     call system_clock(counti)
     !$omp end master
     !$omp barrier
     read(tunit) values
     close(unit=tunit)
     !$omp barrier
     !$omp master
     call system_clock(countf)

     telapsed_read = real(countf - counti, kind=real64)/real(count_rate, kind=real64)
     read_speed = mib_read/telapsed_read
     write(*,'(A29,2F10.3)') ' write / read speed (MiB/s): ', write_speed, read_speed
     !$OMP END MASTER

     call sleep(1)

  end do
  !$OMP END PARALLEL

END PROGRAM iometer

Upvotes: 2

Related Questions