Reputation: 7647
I want to minimize the size of output files in FORTRAN without losing any data. To find the best method for doing so I wrote the program:
program test
character(len=255) format
1 format(9i3)
c FORMATTED
open(99,file='form1.txt',form='formatted')
do i=1,1
write(99,1) 1, 2, 3, 4, 5, 6, 7, 8, 9
enddo
close(99)
c UNFORMATTED
open(98,file='form2.txt',form='unformatted')
do i=1,1
write(98) 1, 2, 3, 4, 5, 6, 7, 8, 9
enddo
close(98)
c DIRECT ACCESS
nrec=sizeof(i)*9
open(97,file='form3.txt',form='unformatted',
& access='direct',recl=nrec)
do i=1,1
write(97,rec=i) 1, 2, 3, 4, 5, 6, 7, 8, 9
enddo
close(97)
call system('ls -lh form?.txt')
end
This will create three files with one record each. The output of this program is:
-rw-r--r--. 1 user users 28 May 27 17:10 form1.txt
-rw-r--r--. 1 user users 44 May 27 17:10 form2.txt
-rw-r--r--. 1 user users 36 May 27 17:10 form3.txt
From Oracle's website:
If FORM='UNFORMATTED', each record is preceded and terminated with an INTEGER*4 count, making each record 8 characters longer than normal. This convention is not shared with other languages, so it is useful only for communicating between FORTRAN programs.
My questions are:
form1.txt
and form2.txt
? Note that the size of file1.txt
depends on the format (e.g. if I change the line format(9i3)
to format(9i4)
the file size of file1.txt
increases by 9 bytes).and my main question is:
A similar question to mine is: Best way to write a large array to file in fortran? Text vs Other
Upvotes: 2
Views: 1345
Reputation: 8140
Basically your format 9i3
means that every number will take up exactly 3 bytes in the file. That's 27 bytes plus one for the carriage return makes 28.
But you can only store numbers up to 999 in this format, and even then, numbers over 99 will blend together.
Unformatted Direct Access stores the binary representation of the ints, so 32 bit or 4 bytes per number. That's 36 bytes in total. That's more than the 28 of your formatted version, but it can work with all integer numbers, up to 2,147,483,647 and down to -2,147,483,648 while still being the same size. (If you wanted the same flexibility in the formatted version, you'd need format 9I11
for 100 bytes total).
The unformatted (sequential) version is a bit in the middle, as unformatted, it stores the binary representation, but it also does still store some metadata (the record length), that's why its a bit bigger still, but like unformatted direct access, you could store all integer numbers that way while taking up the same amount of space.
As for your second question, what you should use depends on a lot of things. As you noticed, if your integers are always between 0 and 99, then their string representation is smaller than their binary representation. But once you need 4 digits (including the sign), then binary representation gets smaller. I should probably also point out that if your numbers are small, you might as well declare them as 8- or 16-bit integers, which would mean that they only take up one or two bytes respectively.
Binary representation also is faster, as the numbers don't need to be converted between binary and string.
But for sizes that you are talking about, it might be valuable to investigate other file formats, like NetCDF which has some methods of compressing data.
Upvotes: 4
Reputation: 436
While not directly addressing your question, I would like to note that there is a lower limit to file size if you use binary data. Even if the most dense storage representation without any checksums or meta-information about e.g. record length is used, you will have to store sizeof(datatype)*num_entries bytes.
You could use a fast compression algorithm like blosc, even capable of outperforming C's RAM-to-RAM memcpy()
. Effectiveness and performance obviously strongly depend on the distribution of your data, but can reach tens of GB/s in real world applications.
100GB is probably to much data to fit into your machine's RAM. It is possible to either chunk the files by hand, or use a library like HDF5. HDF5 provides compressed chunked storage for basically arbitrary amounts of data with high performance. However incorporation of a large library could be some work, even if there is a HDF5 Fortran API.
Upvotes: 3