Python reading unformatted direct access Fortran 90 gives incorrect output

Question

Here is how the data is written (its a 2-D matrix of floats. I am not sure the size).

open(unit=51,file='rmsd/'//nn_output,form='unformatted',access='direct',status='replace',&
     recl=Npoints*sizeofreal)

!a bunch of code omitted 

    write(51,rec=idx-nstart+1) (real(dist(jdx)),jdx=1,Npoints)

Here is how I am trying to read the file, inspired from the answer to these similar questions: 1, 2.

   f = open(inputfilename,'rb')
   field = np.fromfile(f,dtype='float64')

But the result is not correct. The diagonal of the matrix should be 0 (or very close) as it is a similarity matrix. I tried different dtypes but I still could not get a correct result.

EDIT: here is the output I get

array([  1.17610188e+01,   2.45736970e+02,   7.79741823e+02, ...,
         9.52930627e+05,   8.93743127e+05,   7.64186127e+05])

I haven't bothered with reshaping yet, as the values should range from 0 to ~20.

EDIT 2: Here is a link to the first 100 lines of the file: https://drive.google.com/file/d/0B2Mz7CoRS5g5SmRxTUg5X19saGs/view?usp=sharing

EDIT 3: Declarations at top of file

integer,parameter :: real_kind=8
!valuables for MPI
integer :: comm, nproc, myid, ierr

integer,allocatable :: idneigh(:)
real :: tmp
real,allocatable :: traj(:,:)
real(real_kind),allocatable :: dist(:)
real(real_kind),allocatable :: xx(:,:),yy(:,:),rot(:),weight(:)
character(200) :: nn_traj, nn_output, nn_neigh
integer,parameter :: sizeofreal=4,sizeofinteger=4

casey · Accepted Answer

Unformatted Fortran binary files encode record length to delimit the records. In general the record length will be written both before and after each record, though if memory serves the details of that are processor dependent (See the second half of the post if records are not delimited in this way). Looking at the file you posted, if you interpret the first 4 bytes as an integer and the remaining bytes as 32 bit floating point values you get:

0000000        881505604    7.302916e+00    8.723415e+00    6.914254e+00
0000020     9.826199e+00    7.044637e+00    8.601265e+00    6.629045e+00
0000040     6.103047e+00    9.476192e+00    9.326468e+00    6.535160e+00
0000060     8.904651e+00    4.710213e+00    6.534080e+00    1.156603e+01
0000100     1.046533e+01    9.343380e+00    8.574672e+00    7.498291e+00
0000120     1.071538e+01    7.138038e+00    5.898036e+00    6.182026e+00
0000140     7.037515e+00    6.418780e+00    6.294755e+00    8.327971e+00
0000160     6.796582e+00    7.397069e+00    6.493272e+00    1.126087e+01
0000200     6.467663e+00    7.178994e+00    7.867798e+00    5.921878e+00

If you seek past this record length field you can read the rest of the record into a python variable. You will want to specify the correct number of bytes because there will be another record length at the end of the record and that will read in as a wrong value into your array. 881505604 implies your NPoints is 220376401 (if this is not true, then see the second half of the post, this may be data and not a record length).

You can read this data in python with:

f = open('fortran_output', 'rb')
recl = np.fromfile(f, dtype='int32', count=1)
f.seek(4)
field = np.fromfile(f, dtype='float32')

print('Record length=',recl)
print(field)

This reads the record length, seeks 4 bytes forward and reads the rest of the file into a float32 array. You will want to, as mentioned before, specify a proper count= to the read to not intake the end record field.

This program outputs the following for your input file:

Record length= [881505604]
[ 7.30291557  8.72341537  6.91425419 ...,  6.4588294   6.53710747
  6.01582813]

All of the 32 bit reals in the array are between 0 and 20 as you suggest is proper for your data, so this looks like what you want.

If, however, your compiler does not encode record length in delimiting records for unformatted, direct access files, then the output is instead interpreted as 32 bit floats is:

0000000    2.583639e-07       7.3029156        8.723415        6.914254
0000020        9.826199        7.044637        8.601265        6.629045
0000040        6.103047       9.4761915        9.326468       6.5351596
0000060        8.904651        4.710213         6.53408       11.566033
0000100       10.465328         9.34338        8.574672        7.498291
0000120       10.715377        7.138038       5.8980355        6.182026
0000140       7.0375147         6.41878       6.2947555       8.3279705
0000160       6.7965817       7.3970685       6.4932723       11.260868
0000200        6.467663        7.178994        7.867798       5.9218783
0000220        6.710998         5.71757       6.1372333        5.809089

where the first value is much smaller toward zero, which may be appropriate given you say the matrix diagonal should be 0.

To read this you can do the read in python just as you were attempting, though it would be prudent to use count= to only read in the proper record length if there is more than one record in the file.

Using the python code:

f = open('fortran_output', 'rb')
field = np.fromfile(f, dtype='float32')

print(field)

produces the output

[  2.58363912e-07   7.30291557e+00   8.72341537e+00 ...,   6.45882940e+00
   6.53710747e+00   6.01582813e+00]

which matches the file output interpreted as 32 bit floats.

The particulars of the writing are generally with no record delimiter for this kind of output when tested with various gfortran and ifort versions, but may be like the first half of the post or something different for other compilers.

I also want to re-iterate the caveat that this solution is not with 100% confidence since we do not have the data that you wrote into this file to verify. You do have that information though, and you should verify the data produced is proper. It is also worth noting when compiling with default flags, ifort record lengths are in 4 byte words, while gfortran is in 1 byte units. This won't cause a problem in output but if not compensated for in ifort, your files will be 4 times bigger than needed. You can get record lengths in byte units by using -assume byterecl with ifort.

Python reading unformatted direct access Fortran 90 gives incorrect output

Answers (1)

Related Questions