bsmile
bsmile

Reputation: 69

How to detect whether a file is formatted or unformatted?

The way I am using is the following. I try to open a file in the default formatted form and test reading it. If failed (error or reaching file end), then unformatted. But this does not give me confidence in the file types, after all, why would an unformatted file fail to give a formatted reading, and why would a formatted file give a failed unformatted reading. I would expect that unformatted file read as formatted returns most likely error but not guaranteed, formatted file read as unformatted gives weird things but not an error (a test code actually returns end of file). Any better ways to check file type?

Upvotes: 2

Views: 1434

Answers (2)

innoSPG
innoSPG

Reputation: 4656

Short answer

Formatted file contains mostly ASCII. Processors and implementations allow you to have non ascii, writing them to file is OK, but reading them back can be a problem if read as formatted. Assuming that your formatted files have only ASCII characters and that your unformatted file are not limitted to text, the following subroutine will do the job.

!
subroutine detect_format(fName)
    character(*), intent(in) :: fName
    integer :: fId, stat
    character :: c
    logical :: formatted
    !
    stat = 0
    formatted = .true. !assume formatted
    open(newunit=fId,file=fName,status='old',form='unformatted',recl=1)
    ! I assume that it fails only on the end of file
    do while((stat==0).and.formatted)
        read(fId, iostat=stat)c
        formatted = formatted.and.( iachar(c)<=127 )
    end do
    if(formatted)then
        print*, trim(fName), ' is a formatted file'
    else
        print*, trim(fName), ' is an unformatted file'
    end if
    close(fId)
    !
end subroutine detect_format

If your unformatted file contains only characters, this procedure will not help. Anyway, there is no difference between formatted and unformatted characters file, unless it is an unformatted with variable record size. In that special case, you can catch it with the record size that is saved.

You can use some heuristics to simplify it. For example, you can say that you consider it ASCII if the first 100 bytes are ASCII. Or you can say you consider it ASCII if more that 80% are ASCII. The subroutine can be made simple by using stream-based IO.

Long answer

The first thing is to understand: - the internal representation of data in computer memory (RAM, disk, etc.); - the external representation; - as well as the difference between them. The second thing is to understand the fortran distinction of formatted versus unformatted files.

  1. Internal and external representation of data in computer memory.

By internal representation, I mean the form under which the CPU process the data. That is the binary representation. In the internal representation, you must know the type of the data to give it a meaning. By external representation I mean the glyphs that get printed on your screen or on the paper from your printer. For example, if we are processing only numbers, the glyphs are the symbols (0, 1, 2, ..., 9) for the latin based languages, (I, II, III, IV, X, ...) for roman. Follow this link for the glyphs in other languages. I am going a little far away from what the fortran standard defines, but this is for the purpose of the transition. The fortran standard uses only the symbols (0, 1, 2, ..., 9), but some implementations account for the decimal separator that can either be a comma or a dot. The human brain is able to figure out what it is by looking at the external representation. In between the internal representation and the external representation, there is an intermediate representation that helps human and computers to understand each other. And that form is what makes the difference between the formatted and the unformatted files in fortran. That intermediate form is the computer internal representation of the external representation (computer does not store glyph, it only draws it on request when you want to see). As computer representation, the intermediate form is binary but it has a 1 to 1 correspondence with the external representation (glyphs).

The storage unit in computer science is the byte. Some people like to go to the level of the bit, but it is not necessary. Data store in computer memory are just strings of bytes. A byte itself is a string of 8 bits, meaning that there are 256 possibilities of values that a byte can store. Further, the bytes are usually grouped by 4 or 8 (in the past they use to call it word). Now any byte or group of bytes makes sense only if you know the type of data it contains. You can process the same string of 4 bytes as a 4 bytes integer, a 4 byte IEEE floating point number, a string of 4 bytes character, etc. If you are processing 4 bytes numbers (integer or IEEE floating points), the internal representation allows byte to take all the possible 256 values (except for very few that are used to define markers NaN Inf, etc. but they are still values). If you are processing English text (ASCII), the internal representation allows byte to takes only the first 127 values. When it comes to the external representation, everything must be turn into glyph: numbers and characters alike. The intermediate representation has to map numbers to glyphs. Each number must be turned into a string of digit. Because the digit are also ASCII characters, everything get limited to the 127 values of bytes. That is the key to determine the content of your file.

  1. Fortran formatted and unformatted files

When it comes to fortran, it mostly uses formatted files for human readable content. The content of the file will be the intermediate representation, limited to ASCII for english language. Unformatted files represent the binary or internal representation of data as there are processed in the CPU. It is like a dump of the RAM.

Now to detect the content with modern fortran compiler, you just have to open the file and read it byte by byte and check if it contains only ASCII. If you get non ASCII you have an unformatted file, otherwise you have a formatted file. Reading byte by byte can be done by using stream-based IO in modern compilers or fixed-size record of 1 byte each. The later is the one used in the example.

I have to add that the life is not that simple. That procedure gives only a high probability not the exact truth. Being all in the range of ASCII does not garantee that it is automatically characters. If you have a character file, it does not matter if it is formatted or fixed size record unformatted, it will contain ASCII.

Upvotes: 4

Holmz
Holmz

Reputation: 724

One approach is to name the files in a logical way. Personally I use .dat, .txt or .csv for formatted data, and I use .bin for binary data. Unless you have hundreds+ of files, then perhaps you can just open them with an editor and see what it looks like?

Upvotes: 1

Related Questions