Pavel Ondračka
Pavel Ondračka

Reputation: 23

Parallelizing formated writes in Fortran with OpenMP

I'm trying to parallelize a Fortran code which at one moment writes a tons of numbers to formated output. Some simple profiling showed that most CPU time is spent in format conversion, so I had the idea to do the formatting in parallel to character buffers and later write the unformatted buffers to the file.

My proof of concept looks like this:

program parawrite
   implicit none

   integer (kind = 4) :: i, j, tstart, tstop, rate
   integer (kind = 4), parameter :: bufsize = 100000, n = 10000000, llen = 22
   character (kind=1, len=:), allocatable :: buf
   real (kind=8), dimension(n) :: a

! some input
   do i = 1, n
      a(i) = dble(i) * dble(i)
   enddo

! formated writes for reference
   open(unit=10, file="out1.txt", form="formatted")
   call system_clock(tstart, rate);
   do i = 1, n
      write(10,"(E21.15)") a(i)
   end do
   call system_clock(tstop, rate);
   print *, 'Formated write: ', dble(tstop - tstart) / dble(rate), 's'
   close(10)

! parallel stuff
   open(unit=10, file="out2.txt", access="stream", form="unformatted")
   call system_clock(tstart, rate);

!$omp parallel private(buf, j)
   allocate(character(bufsize * llen) :: buf)
   j = 0;
!$omp do ordered schedule(dynamic,bufsize)
   do i = 1, n
      write (buf(j*llen+1:(j+1)*llen),"(E21.15,A1)") a(i), char(10)
      j = j + 1
      if (mod(i, bufsize) == 0) then
!$omp ordered
         write (10) buf
!$omp end ordered
         j = 0
      end if
   end do
   deallocate(buf)
!$omp end parallel

   close(10)
   call system_clock(tstop, rate);
   print *, 'Parallel write: ', dble(tstop - tstart) / dble(rate), 's'

end program parawrite

When I run it, however, not only is the parallel version much slower when at single thread, it also doesn't scale too much...

$ gfortran -O2 -fopenmp writetest.f90

$ OMP_NUM_THREADS=1 ./a.out
Formated write:    11.330000000000000      s
Parallel write:    15.625999999999999      s

$ OMP_NUM_THREADS=6 ./a.out
Formated write:    11.331000000000000      s
Parallel write:    6.1799999999999997      s

My first question would be how to make it the same speed at single thread? The time spent writing the buffer to the file is negligible, so why are the writes to the buffer slower than when writing directly to file?

My second question is about why the scaling is so bad? I have an equivalent C code which uses sprintf and fwrite and there I can get almost perfect linear scaling (I can post the code if needed), however with Fortran I can only reduce runtime to around 40% at 6 threads (with C I can reduce it to 18% at the same number of threads). It is still faster than the serial version, but I hope this could be improved.

Upvotes: 2

Views: 1127

Answers (1)

roygvib
roygvib

Reputation: 7395

From some experiments, it seems that an internal file is rather slow if an array element is converted to an internal file one at a time. This is also the case for an external file, but the degree of slowdown seems much greater for internal files (for some reason...). So I've modified the code such that a set of array elements are converted at once and then written to an external file via stream output. Below, four patterns are compared:

  • Sequential (1): The original code (which writes each element via do-loop)
  • Sequential (2): Write an array at once (or via implied loop) to an external file
  • Parallel (1): Make an internal file for many elements and then write to an external file
  • Parallel (2): Simplest parallel code with formatted write or spirntf for each element

Among these, Parallel (2) + sprintf (marked with *2 in the code) was the fastest, while Parallel (2) + write for each element (marked with *1) was the slowest (timing shown as Parallel (*) in the table, which does not scale with OpenMP for some reason). I guess sprintf will be the fastest probably because of the least amount of internal checks and overhead etc (just a guess!)

Results (please see the bottom for the modified codes)

$ gcc -O3 -c conv.c && gfortran -O3 -fopenmp test.f90 conv.o

# Machine: Core i7-8550U (1.8GHz), 4-core/8-thread, Ubuntu18.04 (GCC7.3.0)

# Note: The amount of data has been reduced to 1/5 of the 
# original code, n = bufsize * 20, but the relative
# timing results remain the same even for larger data.

$ OMP_NUM_THREADS=1 ./a.out
 Sequential (1):   2.0080000000000000      s
 Sequential (2):   1.6510000000000000      s
 Parallel   (1):   1.6960000000000000      s
 Parallel   (2):   1.2640000000000000      s
 Parallel   (*):   3.1480000000000001      s

$ OMP_NUM_THREADS=2 ./a.out
 Sequential (1):   1.9990000000000001      s
 Sequential (2):   1.6479999999999999      s
 Parallel   (1):   0.98599999999999999     s
 Parallel   (2):   0.72999999999999998     s
 Parallel   (*):   1.8600000000000001      s   

$ OMP_NUM_THREADS=4 ./a.out
 Sequential (1):   2.0289999999999999      s
 Sequential (2):   1.6499999999999999      s
 Parallel   (1):   0.61199999999999999     s
 Parallel   (2):   0.49399999999999999     s
 Parallel   (*):   1.4470000000000001      s

$ OMP_NUM_THREADS=8 ./a.out
 Sequential (1):   2.0059999999999998      s
 Sequential (2):   1.6499999999999999      s
 Parallel   (1):   0.56200000000000006     s
 Parallel   (2):   0.41299999999999998     s
 Parallel   (*):   1.7689999999999999      s

main.f90:

program main
    implicit none
    integer :: i, j, k, tstart, tstop, rate, idiv, ind1, ind2
    integer, parameter :: bufsize = 100000, n = bufsize * 20, llen = 22, ndiv = 8
    character(len=:), allocatable :: buf(:), words(:)
    character(llen + 1) :: word
    real(8), allocatable :: a(:)

    allocate( a( n ) )

! Some input
    do i = 1, n
        a(i) = dble(i)**2
    enddo

!.........................................................
! Formatted writes (1).

    open(unit=10, file="dat_seq1.txt", form="formatted")
    call system_clock(tstart, rate);

    do i = 1, n
        write(10,"(ES21.15)") a(i)
    end do

    call system_clock(tstop, rate);
    print *, 'Sequential (1):', dble(tstop - tstart) / dble(rate), 's'
    close(10)

!.........................................................
! Formatted writes (2).

    open(unit=10, file="dat_seq2.txt", form="formatted")
    call system_clock(tstart, rate);

    write( 10, "(ES21.15)" ) a
!    write( 10, "(ES21.15)" ) ( a( k ), k = 1, n )

    call system_clock(tstop, rate);
    print *, 'Sequential (2):', dble(tstop - tstart) / dble(rate), 's'
    close(10)

!.........................................................
! Parallel writes (1): make a formatted string for many elements at once

    allocate( character( llen * bufsize / ndiv ) :: buf( ndiv ) )

    open(unit=10, file="dat_par1.txt", access="stream", form="unformatted")
    call system_clock(tstart, rate);

    do i = 1, n, bufsize

       !$omp parallel do private( idiv, ind1, ind2, k ) shared( i, buf, a )
        do idiv = 1, ndiv
            ind1 = i + (idiv - 1) * bufsize / ndiv
            ind2 = ind1 + bufsize / ndiv - 1

            write( buf( idiv ),"(*(ES21.15, A1))") &
                    ( a( k ), char(10), k = ind1, ind2 )
        enddo
        !$omp end parallel do

        write(10) buf
    end do

    call system_clock(tstop, rate);
    print *, 'Parallel   (1):', dble(tstop - tstart) / dble(rate), 's'
    deallocate(buf)
    close(10)

!.........................................................
! Parallel writes (2): sprintf vs write for each element

    allocate( character( llen ) :: words( n ) )

    open(unit=10, file="dat_par2.txt", access="stream", form="unformatted")
    call system_clock(tstart, rate);

    !$omp parallel do private( i, word ) shared( a, words )
    do i = 1, n
        ! write( word, "(ES21.15, A1)" ) a( i ), char(10)  !! slow (*1)
        call conv( word, a( i ) )  !! sprintf (*2)
        words( i ) = word( 1 : llen )
    enddo
    !$omp end parallel do

    write( 10 ) words

    call system_clock(tstop, rate);
    print *, 'Parallel   (2):', dble(tstop - tstart) / dble(rate), 's'
    close(10)

end program

conv.c:

#include <stdio.h>

void conv_( char *buf, double *val )
{
    sprintf( buf, "%21.15E\n", *val );
}

Upvotes: 1

Related Questions