Reputation: 79
What is the deciding factor for classifying a file into Binary or Text file?
E.g: Consider the below C program
NOTE: Before running the program make sure binary.txt doesnt exist.
Observation:
File created "binary.txt" with contents TEXTFILE
#include <stdio.h>
int main()
{
int arr[2] = {1415071060,1162627398};
FILE *fp = fopen("binary.txt", "wb");
if(fp == NULL)
{
printf("Error opening file\n");
exit(1);
}
fwrite(arr, sizeof(arr), 1, fp);
fclose(fp);
return 0;
}
However only creator knows that it is created in binary mode and this should be called binary file.
Anyone who opens the file "binary.txt" think its text file.
What a general user should call this file - Binary or Text file?
Upvotes: 5
Views: 5793
Reputation: 12204
I think you are asking two different questions.
File contents
If the file contains textual data, i.e., lines of characters delimited by newlines, then it is a text file.
Otherwise it is presumed to contain data in some form other than strictly character data, such as binary integers, floating-point numbers, image pixels, music samples, structured binary data, etc., which means that it is a binary file, i.e., a non-text file.
There are many other text file formats, such as .xml
, .html
, .csv
, as well as programming language source code files. These are strictly character text files, but generally have some kind of internal structure based on the syntax of their contents.
That being said, all text files are inherently binary files, in the sense that the characters, newlines, and so forth comprising the textual data in the file are nothing more that a stream of bytes at the lowest level.
File name
Specifically, the filename extension or suffix. By convention, files with a .txt
extension are presumed to contain text data, i.e., lines of character data delimited by some kind of newline sequences.
A different filename extension like .bin
or .exe
(or a hundred others) indicate some kind of binary data file, usually structured in some way. By convention, .bin
indicates binary data with no specific format, i.e., just a stream of bytes.
In addition, there are files having an extension like .doc
or .pdf
(or dozens of others), indicating a word processing document file. These files also contain character text data, but it is typically stored in some kind of strictly binary format that is specific to the word processing software used to create it.
Upvotes: 1
Reputation: 20772
This question has changed substantially since it was first posed. In particular, the term "executable" has been removed from the discussion.
Current question:
Only creator knows that it is created in binary mode and this should be called binary file.
The creator has not only created the file but also made it available. If the purpose and format was not communicated then that is a failure somewhere.
Anyone who opens the file "binary.txt" think it's text file.
People would probably think so, but they still can't properly process it as a text file without knowing the character encoding. Again, a communications failure. A guessed-at character encoding that works today might not work for the contents of the file tomorrow.
Answer to original question:
Yes, it's all a matter of interpretation. Interpretation requires context and metadata.
In addition to what others have said,
A file cannot be text unless you know which character encoding was used to write it (and must be used to read it). Common file systems do not store this knowledge. People dealing in text files must pass this essential metadata on to programs and other people.
A file cannot be executable unless you know which interpreter program or program loader to load it with. Systems have schemes for this:
#!
line stating the program to run it in.PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
The extension would be registered with an Open verb indicating how to "open" or start it.A file can be called binary whether or not you have metadata to call it text or executable or both.
Upvotes: 3
Reputation: 764
NOTE: Limiting our discussion to ASCII (multi-byte charsets, other Encodings are set aside to avoid unnecessary confusion)
Let us understand the difference between string and array of characters
In a byte of 8 bits
, we can store 0 to 255
if unsigned, -128 to +127
if signed
As a whole, if we see a byte (8 bits
) the value that can be fit into it is -128 to 255
(range). The range of ASCII characters (0 to 127
).
Given character array a[10]
if any of the bytes a[0] to a[9]
has value out of the range of ASCII character range then it is not a string, its just array of characters. If all of the bytes fall within the ASCII range (0 to 127
) then it is string.
In summary for the array of characters, the range can be any of (-128 to 255
).
Important conclusion here is since ASCII range (0 to 127
) is a proper subset of -128 to 255
All strings can be called the array of characters.
Now let us apply the above definition to binary file vs text file.
If in a file all bytes are in the range of ASCII (0 to 127
) it should be called a text file.
If any of them falls out of this range i.e any of (-128 to -1
) or (128 to 255
) then it is a binary file.
In summary, since ASCII range 0 to 127
is a proper subset of (-128 to 255
) all text files are binary files.
If a file has atleast one byte from (-128 to -1
) or (128 to 255
) it cannot be text file only binary file.
I have not verified standards if any of ASCII range character(s) has special treatment. But in summary I think I made the distinction behind text file vs binary file clear.
Hope this helps
Upvotes: -5
Reputation: 144520
On modern operating systems, there is no distinction at the file system level between text files and binary files. On legacy systems, the C library implements a series of tricks to translate newlines between OS specific representations (such as 0x0D
0x0A
) and the single byte representation '\n'
for the C program reading the file in text mode. This compatibility layer must not be used when dealing with actual binary contents, for which the b
option must be used in fopen()
.
Older operating systems used to have different representations for text and binary files, but most of these are obsolete nowadays.
Conversely, many file systems keep track of executable files with some specific information such as mode bits on Unix FS. These executable files can be binary, containing one form or another of executable code, while others are text files containing scripts.
In your example, whether the file should be seen as binary or text is a matter of intent. If the creator of the file intended for is to be read as binary, naming it binary.txt
is confusing as the filename extension .txt
is routinely used to indicate generic text files. sample.bin
would be much more obvious.
How to interpret the contents of a file is important for programmers and casual users: on legacy systems, loading and save a file as text may change its contents, unless you use tools that are terminally anal about preserving contents.
For example qemacs, a programmer's editor inspired by emacs, makes extensive efforts upon loading a file to determine the best mode for displaying and editing the contents:
If the file is written back without modifications, the contents are preserved so binary files that happen to have textual contents are unmodified. Otherwise, the above tests determine the correct conventions for encoding new contents.
Upvotes: 4
Reputation: 206557
@JohnBollinger summarized it best in a comment.
text vs. binary is not a fundamental file characteristic on modern operating systems, but rather a differentiation between how files are interpreted.
Let's say a file contains four bytes with the following hex values of the bytes:
0x41 0x42 0x43 0x44
If you interpret those bytes as characters in a system that uses ASCII encoding, you will get the characters ABCD
.
If you treat those bytes as a 4-byte integer, you will get the value 0x41424344
(1094861636 in decimal) in a big endian system and 0x44434241
(1145258561 in decimal) in a little endian system.
As far as the computer is concerned, it's all binary. As to what they mean, it's all a matter of intepretation.
Upvotes: 11
Reputation: 47915
In general, a file is just a sequence of bytes.
For any machine you're likely to use, bytes are 8 bits. So each byte has 256 possible values.
Confining our attention for the moment to old-fashioned ASCII. something like 95 of those bytes are ordinary, printing characters: letters, digits, punctuation. There are a few more characters which may also appear in text files: let's say tab, carriage return, linefeed, and form feed ('\t'
, '\r'
, '\n'
, and '\f'
).
If every one of the bytes in a file is one of those printing characters, the file is a text file.
If any of the bytes in a file is other than one of those printing characters, the file is not a text file.
If the file is intended for human consumption, its creator will have used only the ordinary printing characters, and it will be a text file.
If the file contains arbitrary data, each byte might have any of its 256 possible values, and the file will be a binary file. It's very likely that at least one of the bytes in such a file will be something other than an ordinary printing character. (Even if all the arbitrary bytes just happen to be in the set of ordinary printable characters, they probably won't mean much, and we might still think of it as a binary file.)
Anyway, that's why every text file is theoretically a binary file, but not every binary file is a text file.
As a practical example, try this program:
#include <stdio.h>
int main()
{
short int x = 906;
FILE *fp1 = fopen("textfile.txt", "w");
FILE *fp2 = fopen("binaryfile.bin", "wb");
if(fp1 == NULL || fp2 == NULL) exit(1);
fprintf(fp1, "%d\n", x);
fwrite(&x, sizeof(x), 1, fp2);
fclose(fp1);
fclose(fp2);
}
If you compile and run this program, you should find that it creates a text file textfile.txt
containing the string 12345
. But if you inspect the file binaryfile.bin
, you should find that it contains just two bytes, with the hexadecimal values 03
and 8A
. Neither of those is an ordinary printing character, so it's a binary file.
Now, try changing the program slightly, setting
short int x = 12345;
If you run it again, textfile.txt
will now contain the string 12345
, as expected. binaryfile.bin
will again contain two bytes, this time with hex values 30
and 39
. But if you try printing binaryfile.bin
, you'll probably see the characters 0
and 9
, because 0x30
and 0x39
are the ASCII codes for the characters 0
and 9
.
Upvotes: -1
Reputation: 35154
I think one has to distinguish "text", "binary", and "executable":
"Text" usually means a file containing only human readable characters (alpha + numeric + tabs and cr/lf), i.e. something that you can open with a text editor without seeing weird stuff.
The meaning of "binary" often depends on the context. If the context is, for example, the open mode used in file processing, then "binary" means that each byte is read in as is, whereas "text" means that platform specific conversions like automatically converting a "\r\n"
into a single "\n"
apply (cf., for example, FILE *fp=fopen("c:\\test.txt", "rb")
versus FILE *fp=fopen("c:\\test.txt", "rt")
). If the context is the distribution format of programs, then "binary" often means "precompiled for a particular platform". This is in contrast to source code distributions, where the files are typically "text files".
The meaning of "executable" is that the file content is interpreted by the operating system as an executable program. This often means a file containing machine code instructions, which contain non-readable characters as well, such that they are usually not "text files", and they are usually not interpreted as text. In a broader sense, also shell scripts are "executables", as they contain instructions interpreted by the respective shell. These instructions are written as text and can be opened in a text editor.
From these perspectives, I think that "text" and "binary" are opposite terms, whereas "executable" is orthogonal to both.
Upvotes: 2