N56 dH
N56 dH

Reputation: 161

a clear understanding of file, file encoding, file format

I lack a clear understanding of the concepts of file, file encoding and file format. Google helped up to a point. From what I understand so far, all the files are binary, i.e., each byte in such a file can contain any of the 256 possible strings of bits. ASCII files (and here's where we get to the encoding part) are a subset of binary files, where each byte uses only 7 bits.

And here's where things get mixed up. A file format seems to be a way to interpret the bytes in a file, and file extensions seem to be one of the most used ways of identifying a file format.

Does this mean there are formats defined for binary files and formats defined for ASCII files? Are formats like xml, pdf, doc, rtf, html, xls, sql, tex, java, cs "referring" to ASCII files? Whereas formats like jpg, mp3, avi, eps, obj, out, dll are a clue that we're talking about binary files?

Upvotes: 16

Views: 16279

Answers (4)

marshal craft
marshal craft

Reputation: 447

I think it is worth noting that with media files, mpeg and others are a form of media codecs. They explain how digital data can express visual and audio. They are generally housed in a media file container such as an avi file which is really a riff file type that is for media.

Upvotes: 0

panther
panther

Reputation: 767

This is an old question but still very relevant. I was confused by this as well, and asked around for clarification. Here's the summary (hope it helps someone):

Format: File/record format is the way data is represented. You might use CSV, TSV, JSON, Apache Log format, Thrift format, Protobuf format etc to represent your data. Format is responsible for ensuring the data is structured properly and correctly represented. Ex: when you read a json file, you should have nested key-value pairs; that's the guarantee always present.

{
    "story": {
        "title": "beauty and the beast"
    }
}

Encoding: Encoding basically transforms your data (in any format or plain text) to a specific scheme. Now, what is this scheme? Scheme is specific to the purpose of encoding. Example, while transferring data over wire (internet), we would want to make sure the above example json reach the other side correctly, should not be corrupted. To ensure this, we would add some meta info like checksum that can be used to verify data's correctness. Other usage of encoding involve shortening data, exchanging secret etc.

Base64 encoding of above JSON example:

ew0KICAgICAgICAic3RvcnkiOiB7DQogICAgICAgICAgICAidGl0bGUiOiAiYmVhdXR5IGFuZCB0aGUgYmVhc3QiDQogICAgICAgIH0NCn0=

Upvotes: 1

Pablo Santa Cruz
Pablo Santa Cruz

Reputation: 181300

I don't think you can talk about ASCII and BINARY files, but TEXT and BINARY files.

In that sense, these are text files: XML, HTML, RTF, SQL, TEXT, JAVA, CSS, EPS.

And these are binary files: PDF, DOC, XLS, JPG, MP3, AVI, OBJ, DLL.

ASCII is just a table of characters used in the beginning of computing to represent text, but its is nowadays somewhat discouraged since it can't represent text in languages such as Chinese, Arabic, Spanish (word with ñ, Ñ, tildes), French and others. Nowadays other CHARACTER REPRESENTATIONS are encouraged instead of ASCII. The most well known is probably UTF-8. But there are others like ISO-8859-1, ISO-8859-3 and such. Take a look at this article by Joel Spolsky talking about UNICODE. It's very enlightening.

File formats are just another very different issue. File formats are protocols which programs agree on, to represent information. In that sense, a JPG file is an image that has a certain (well know) internal format that allows programs (Browsers, Spreadsheets, Word Processors) to use them as images.

Text files also have formats (I.E., there are specifications for text files like XML and HTML). Its format, as in JPG and other binary files permits applications to use them in a coherent and specific way to achieve something: I.E., render a WEB PAGE (HTML and XHTML file format).

Upvotes: 13

Dani
Dani

Reputation: 15069

The actual way the file is stored on the hard-drive is defined by the OS. The actual content of the file can be described as array of bytes - each one has up to a byte size possible values.

Text files - will use either the 256 char (ASCII) set - and then you can read them easily or a wider char set - in that case - only suitable apps can read it.

The rest - what you might call binary (and any other formats which is "unreadable" by "text" viewers) - are formats that designed to be read by a certain other apps or the OS. if it's executable - the OS can read them and execute, others - like jpg - designed to be "understand" by photo viewers ect....

Upvotes: 2

Related Questions