NSString to NSData encoding considerations

Question

I understand why when going from NSData to NSString you need to specify encoding. However I'm finding it frustrating how the reverse (NSString to NSData) needs to have an encoding specified.

In this related question the answers suggested using NSUTF8StringEncoding or defaultCStringEncoding, with the latter not being fully explained.

So I just wanted to ask IF the following is correct when converting NSString to NSData:

In cases where you want to be 100% sure the binary representation of the NSString object is UTF8 then use NSUTF8StringEncoding (or whatever encoding is needed)
In cases where the encoding of the NSString object is known/expected to already be of a certain type and no conversion is required then it's safe (perhaps internally faster) to use defaultCStringEncoding (from what I have read objective-c uses UTF-16 internally, not sure if LE or BE but I'd assume LE because the platform is LE)

TIA

chapka · Accepted Answer

The encoding needs to be specified for converting NSString to NSData for the same reason it needs to be specified going from NSData to NSString.

An NSData object is a wrapper for a string of absolutely raw bytes. If the NSString doesn't specify some encoding, it doesn't know what to write, because at the level of ones and zeroes, a UTF-16 encoding looks different from a UTF-8 encoding of the same letter, and of course, if you write UTF-16 as big-endian and read it as little-endian you will get gibberish.

In other words, don't think of it as converting or escaping a string; it's generating a byte buffer, and the encoding tells it which ones and zeroes to write when the next character is "a" and which ones to write when it means "妈".

As for your question...here's my two cents.

1) If you are converting an NSString to an NSData so that your same program can convert it back later, and no other software will need to deal with that NSData until after you've read it back into an NSString, then none of this matters. All that matters is that your string-to-data encoding and your data-to-string encoding match.

2) If you are dealing only with ASCII characters, you can probably get away with a lot, just because many kinds of encoding use the same representation for characters under 128. But this breaks easily, even with little things like smart quotes.

3) Despite the name, defaultCStringEncoding is not something you should use as a default. It's designed for special circumstances where you need to deal with system strings and don't otherwise know how the system deals with its internal strings. It refers to the way strings are handled in the default C implementation, NOT in the NSString internals, so there's not necessarily a performance benefit.

4) If you write a string with an unknown string encoding, and you try to read it back with a different string encoding, your code will fail; in many cases, you will just end up with an empty string.

Bottom line is: who will be trying to interpret your NSData objects? If it's your own app, pick an encoding that makes sense for you (I use UTF8 for everything) and use it for both conversions. Otherwise, figure out what your ecosystem needs to read or write and make that your standard.

NSString to NSData encoding considerations

Answers (1)

Related Questions