Quest Monger
Quest Monger

Reputation: 8652

ASCII vs Unicode + UTF-8

Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding. It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?

Upvotes: 74

Views: 123059

Answers (3)

mation4in
mation4in

Reputation: 15

Unicode and ASCII are both Codepoints + Encoding scheme

Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.


Conversion and Representation(in binary/hexadecimal) of String:

  • String := sequence of Graphemes(character is a "kind of" its subset).
  • Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
  • Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.

Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)

ASCII supports 128 character codepoints(mostly english)

Upvotes: 1

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201508

Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)

And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.

Upvotes: 51

Remy Lebeau
Remy Lebeau

Reputation: 595319

In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.

Upvotes: 77

Related Questions