Read encoded files

I was trying to read some files like images, but when I try to open them with the notepad I found weird codes like this:

ÿH‹\$0H‹t$8HƒÄ _ÃÌÌÌÌÌÌH‰\$H‰l$H‰t$ WAVAWHƒì ·L

Click here to see the image

So I have the following questions:

  1. Why do I find those weird symbols instead of zeros and ones?
  2. Does programmers do this for security or optimization?
  3. Is this an encoding such as ASCII that every symbol has an unique decimal and binary number associated?
  4. Can anyone with the correspondent decoder read this information?

Thank you

Upvotes: 0

Views: 62

Answers (1)

Lunivore
Lunivore

Reputation: 17602

Most data files like images are stored as hexadecimal. If you know the format of the file, you can use a hexadecimal editor (I use HexEdit) to look at the data.

A colour is often stored as RGB, meaning Red, Green, or Blue, so for instance, this is a dark red:

80 00 00 // (there are no spaces in the real file format, but hex editors add them.)

The format of an image depends on how it's stored. Most image formats have ways of encoding the difference between pixels rather than the actual pixels themselves, because there's a lot of information redundancy between the different pixels.

For instance, if I have a picture of the night sky with a focus on the moon, there's probably a big area in one corner that's all much the same shade of grey; encoding that without optimization would mean a hell of a lot of file that just read:

9080b09080b09080b09080b09080b09080b09080b59080b59080b5...

In this case, the grey is slightly bluish-purple, tending towards a brighter blue at the end. I've stored it as RGB here - R:90, G:80, B:b0 - but there are other formats for that storage too. Try them out here.

Instead of listing every pixel, I could equally say instead "6 lots of bluish-gray then it gets brighter in blue":

=6x9080b0+3x000005+...

This reduces the amount of information I would need to transmit. Most optimizations aren't quite that human-readable, but they operate on similar lines (this is a general information principle used in all kinds of things like .zip files too, not just images).

Note that this is still a lossless format; I could always get back to the actual pixel-perfect image. Bitmaps (.bmp) are lossless (though obviously still digital; they will never capture everything a human sees).

A number of formats use the frequency of images to encode the information. It's a bit like looking at a wave form of music, except it's two-dimensional. Depending on the sampling frequency, information could easily be lost here (and often is). JPEGs (.jpg) use lossy compression like this.

The reason you see ASCII characters is because some of the values just happen to coincide with ASCII text codes. It's pure coincidence; Notepad is doing its best to interpret what's essentially gibberish. For instance this colour sequence:

4e4f424f4459

happens to coincide with the letters "NOBODY", but also represents two pixels next to each other. Both are grey, especially the left (R:4e, G:4f, B:42) with the right-most one being a bit more blue (R:4f, G:44, B:59).

But that's only if your format is storing raw pixel information... which is expensive, so it probably isn't the case.

Image formats are a pretty specialist area. The famous XKCD cartoon "Digital Data" showcases the optimizations being made in some of them. This is why, generally speaking, you shouldn't use JPEG for text, but use something like PNG (.png) instead.

Upvotes: 1

Related Questions