Nate-Wilkins
Nate-Wilkins

Reputation: 5502

Does encoding matter when writing to a file?

I was told today that when writing to a file the Encoding in which you write in doesn't matter. I don't know a lot about Encoding but this sounds reasonable considering Encoding is only for reading/viewing?

Does the Encoding in which bytes are read from a file matter? Is the Encoding there only for parsing/display?

ex.

var bytes = getFileBytes();
bytes.remove(new byte[] { 232, 211 });
anotherStream.writeBytes(bytes);
// I'm assuming that Encoding is irrelevant 

Upvotes: 0

Views: 713

Answers (2)

Vladimir Matveev
Vladimir Matveev

Reputation: 127881

Encoding does not matter when you are simply reading bytes from the file and are not trying to interpret these bytes as text. For example, you can safely ignore encoding if you want to, say, copy a file to another file or a file to a socket. Obviously, you also don't need an encoding if the stream contains binary data, e.g. a sequence of ints in binary form. Your example is also perfectly valid, unless you do not understand 232 and 211 bytes as characters.

However, when you start interpreting some file (or any sequence of bytes, e.g. byte array) as text, you just can't ignore encoding, because bytes can be converted to characters only by the means of some encoding. Sure, it is usually possible not to specify an encoding when using something like FileReader, however, in this case the encoding is specified implicitly, usually with your locale encoding as a default. Because of this it is better to always specify the encoding you intend to use when loading character data from byte streams (e.g. via InputStreamReader), so the actual encoding would not depend on the system you're running your program on.

Upvotes: 1

tripleee
tripleee

Reputation: 189517

What I think somebody might have told you is that if you have to choose between encodings, it doesn't matter which one you pick as long as you stick to it.

This obviously ignores issues like the efficiency of the encoding (if one of them stores your typical data in fewer bytes, obviously use that then).

Consider the opposite scenario - you could write in one encoding and then either (a) forget about ever reading the data back in or (b) read the data incorrectly.

To use a contrived example, let's say you cannot use the letter lowercase i in your data file for some reason. So to store that, you need to encode it somehow. You decide to store it as \48. But now, how do you represent the literal sequence \48 unambiguously, should you ever need to? Ah ha, your encoding can accommodate that, too: store any literal backslash as \5C. But of course, when you read the file back in, you have to decode this encoding, or you will end up with the wrong bytes. (ThÁ&s Á&s more common than you may thÁ&nk!)

Upvotes: 3

Related Questions