Reputation: 20296
I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?
I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?
And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?
I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.
Upvotes: 299
Views: 392507
Reputation: 3811
I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).
As far as I know old ASCII characters took one byte per character.
Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so uses less than a byte.
How many bytes does a Unicode character require?
Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.
I assume that one Unicode character can contain every possible character from any language - am I correct?
No. But almost. So basically yes. But still no.
So how many bytes does it need per character?
Same as your 2nd question.
And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode versions?
No, those are encodings. They define how bytes/octets should represent Unicode characters.
A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to http://codepoints.net/U+1F6AA
(replace 1F6AA
with the codepoint in hex) to see an image.
a
©
®
ጷ
—
‰
€
™
☃
☎
☔
☺
⚑
⚛
✈
✞
〠
肉
💩
🚀
Okay I'm getting carried away...
Fun facts:
Upvotes: 58
Reputation: 7049
In Unicode, every character is represented by an integer from zero to 0x10FFFF. Doing this naively in 32-bit integers is called the UTF-32 encoding. To be less wasteful, UTF-8 and UTF-16 are encodings that require less space for the lower codepoints.
Note that what is called UTF-16 in implementations is often really just UCS2: the subset of codepoints that UTF-16 can fit in 32 bits.
The storage requirements are as follows.
In UTF-8:
1 byte: 0 - 7F (ASCII)
2 bytes: 80 - 7FF (all European plus some Middle Eastern)
3 bytes: 800 - FFFF (multilingual plane incl. the top 1792 and private-use)
4 bytes: 10000 - 10FFFF
In UTF-16:
2 bytes: 0 - D7FF (multilingual plane except the top 1792 and private-use)
4 bytes: D800 - 10FFFF
In UTF-32:
4 bytes: 0 - 10FFFF
10FFFF is the last unicode codepoint by definition, and it's defined that way because it's UTF-16's technical limit.
It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8's encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.
Upvotes: 24
Reputation: 21259
In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.
Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.
The only encoding where (as of now) we can make the statement about the size is UTF-32. There it's always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)
What makes it so difficult are at least two things:
U+20AC
can be represented either as three-byte sequence E2 82 AC
or four-byte sequence F0 82 82 AC
.Upvotes: 8
Reputation: 11110
Unicode
is a standard which provides a unique number for every character. These unique numbers are called code point
s (which is just unique code) to all characters existing in the world (some's are still to be added).
For different purposes, you might need to represent this code points
in bytes (most programming languages do so), and here's where Character Encoding
kicks in.
UTF-8
, UTF-16
, UTF-32
and so on are all Character Encodings
, and Unicode's code points are represented in these encodings, in different ways.
UTF-8
encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;
UTF-16
has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it's enough for almost all the cases. Java uses UTF-16
encoding for its strings and characters;
UTF-32
has fixed length and each character takes exactly 4 bytes (32 bits).
Upvotes: 4
Reputation: 781
From Wiki:
UTF-8, an 8-bit variable-width encoding which maximizes compatibility with ASCII;
UTF-16, a 16-bit, variable-width encoding;
UTF-32, a 32-bit, fixed-width encoding.
These are the three most popular different encoding.
Upvotes: 1
Reputation: 4103
Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:
Binary Hex Comments
0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
10xxxxxx 0x80..0xBF Continuation byte: one of 1-3 bytes following the first
110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
11110xxx 0xF0..0xF7 First byte of a 4-byte character encoding
So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.
Upvotes: 269
Reputation: 126309
Check out this Unicode code converter. For example, enter 0x2009
, where 2009 is the Unicode number for thin space, in the "0x... notation" field, and click Convert. The hexadecimal number E2 80 89
(3 bytes) appears in the "UTF-8 code units" field.
Upvotes: 1
Reputation: 40336
You won't see a simple answer because there isn't one.
First, Unicode doesn't contain "every character from every language", although it sure does try.
Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a
or a u
to create a new logical character. A character therefore can consist of 1 or more codepoints.
To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).
Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent"
which is 2 codepoints, one of which is a combining char or "accented 'a'"
which is one codepoint).
Upvotes: 185
Reputation: 1647
For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a "surrogate pair." More specifically, a surrogate pair has the form:
[0xD800 - 0xDBFF] [0xDC00 - 0xDFF]
where [...] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).
See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.
Upvotes: 3
Reputation: 9655
There is a great tool for calculating the bytes of any string in UTF-8: http://mothereff.in/byte-counter
Update: @mathias has made the code public: https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js
Upvotes: 8
Reputation: 8240
Simply speaking Unicode
is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).
Now you need to represent this code points using bytes, thats called character encoding
. UTF-8, UTF-16, UTF-6
are ways of representing those characters.
UTF-8
is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).
UTF-32
each characters have 4 bytes a characters.
UTF-16
uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.
Upvotes: 37
Reputation: 1977
Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)"
As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.
So your simple answer that you want is that it varies.
Upvotes: 5