john
john

Reputation: 31

Unicode vs ASCII memory

What is the difference between Unicode and ASCII in terms of memory? How much memory does Unicode and ASCII take in memory? I know that in Unicode it depends on the encoding type (UTF-8, UTF-16 etc..). But I need deeper understanding!

Upvotes: 2

Views: 3913

Answers (1)

James Fry
James Fry

Reputation: 1153

In short, ASCII uses 7 bit code points (ie 7 bits uniquely identifies every character) where as Unicode is defined using 21 bit code points (0hex to 10FFFFhex, defined as 17 planes of 65536 / 16 bits of characters yields 1,114,112 characters - the nearest power of two is 221). How much memory that uses depends on the way it is encoded in memory (not necessarily the same as the serialisation encoding used to externalise that data in files, typically one of UTF encodings for Unicode).

In practice ASCII is stored as one character per byte in RAM, and it is very rare to see pure ASCII, particularly outside of the USA - it is more common to see ISO8859-1 (an 8 bit encoding that is completely compatible with ASCII, but with other characters that use the extra bit that is available, eg for the £ and ¡ characters needed in some European countries).

Unicode is more complex, and representations vary considerably:

  • Java uses 16 bit characters with the idea of a 'surrogate pair' to represent values that are outside the 'basic multilingual plane' (essentially any character added after Unicode 2.0). This is historic; early versions of Unicode used only 16 bits per character.
  • In C it may often make sense to use the variable-length UTF-8 encoding as the in-memory representation - char is a byte after all, but with such encoding comes a small performance hit when decoding (it makes trying to find the n-th code point more difficult as one has to effectively iterate through the encoded byte array identifying the start of each character).
  • It may also make sense to use UTF-32 (formerly known as UCS-4) as this encodes all Unicode codepoints in a 32 bit integer in a similar way to ASCII being stored in an 8 bit integer.

Joel's article is golden reading for this topic.

Upvotes: 2

Related Questions