builder-7000
builder-7000

Reputation: 7647

Reduce size of output files in FORTRAN

I want to minimize the size of output files in FORTRAN without losing any data. To find the best method for doing so I wrote the program:

      program test                                                              

      character(len=255) format

1     format(9i3)                                                               

c FORMATTED          
      open(99,file='form1.txt',form='formatted')                                
      do i=1,1                                                            
        write(99,1) 1, 2, 3, 4, 5, 6, 7, 8, 9                                   
      enddo                                                                     
      close(99)                                                                 

c UNFORMATTED          
      open(98,file='form2.txt',form='unformatted')                              
      do i=1,1                                                            
        write(98) 1, 2, 3, 4, 5, 6, 7, 8, 9                                     
      enddo                                                                     
      close(98)                                                                 

c DIRECT ACCESS          
      nrec=sizeof(i)*9                                                          
      open(97,file='form3.txt',form='unformatted',                              
     &     access='direct',recl=nrec)                                           
      do i=1,1                                                            
        write(97,rec=i) 1, 2, 3, 4, 5, 6, 7, 8, 9                               
      enddo                                                                     
      close(97)                                                                 

      call system('ls -lh form?.txt')                                           
      end

This will create three files with one record each. The output of this program is:

-rw-r--r--. 1 user users  28 May 27 17:10 form1.txt
-rw-r--r--. 1 user users  44 May 27 17:10 form2.txt
-rw-r--r--. 1 user users  36 May 27 17:10 form3.txt

From Oracle's website:

If FORM='UNFORMATTED', each record is preceded and terminated with an INTEGER*4 count, making each record 8 characters longer than normal. This convention is not shared with other languages, so it is useful only for communicating between FORTRAN programs.

My questions are:

  1. Why there is a difference of 16 bytes (not 8 bytes as mentioned in previous quote) between form1.txt and form2.txt? Note that the size of file1.txt depends on the format (e.g. if I change the line format(9i3) to format(9i4) the file size of file1.txt increases by 9 bytes).

and my main question is:

  1. I have big data files (greater than 100G) with five columns and millions of rows. What is the best method in FORTRAN to reduce the size of my output files (perhaps writing in binary form)?

A similar question to mine is: Best way to write a large array to file in fortran? Text vs Other

Upvotes: 2

Views: 1345

Answers (2)

chw21
chw21

Reputation: 8140

Basically your format 9i3 means that every number will take up exactly 3 bytes in the file. That's 27 bytes plus one for the carriage return makes 28.

But you can only store numbers up to 999 in this format, and even then, numbers over 99 will blend together.

Unformatted Direct Access stores the binary representation of the ints, so 32 bit or 4 bytes per number. That's 36 bytes in total. That's more than the 28 of your formatted version, but it can work with all integer numbers, up to 2,147,483,647 and down to -2,147,483,648 while still being the same size. (If you wanted the same flexibility in the formatted version, you'd need format 9I11 for 100 bytes total).

The unformatted (sequential) version is a bit in the middle, as unformatted, it stores the binary representation, but it also does still store some metadata (the record length), that's why its a bit bigger still, but like unformatted direct access, you could store all integer numbers that way while taking up the same amount of space.

As for your second question, what you should use depends on a lot of things. As you noticed, if your integers are always between 0 and 99, then their string representation is smaller than their binary representation. But once you need 4 digits (including the sign), then binary representation gets smaller. I should probably also point out that if your numbers are small, you might as well declare them as 8- or 16-bit integers, which would mean that they only take up one or two bytes respectively.

Binary representation also is faster, as the numbers don't need to be converted between binary and string.

But for sizes that you are talking about, it might be valuable to investigate other file formats, like NetCDF which has some methods of compressing data.

Upvotes: 4

Obay
Obay

Reputation: 436

While not directly addressing your question, I would like to note that there is a lower limit to file size if you use binary data. Even if the most dense storage representation without any checksums or meta-information about e.g. record length is used, you will have to store sizeof(datatype)*num_entries bytes.

You could use a fast compression algorithm like blosc, even capable of outperforming C's RAM-to-RAM memcpy(). Effectiveness and performance obviously strongly depend on the distribution of your data, but can reach tens of GB/s in real world applications.

100GB is probably to much data to fit into your machine's RAM. It is possible to either chunk the files by hand, or use a library like HDF5. HDF5 provides compressed chunked storage for basically arbitrary amounts of data with high performance. However incorporation of a large library could be some work, even if there is a HDF5 Fortran API.

Upvotes: 3

Related Questions