Reputation: 2035
Is there a good resource for finding the last two characters of each plane, particularly planes 3–13?
Obviously 0xFFFE
and 0xFFFF
is a non character, as well as 0x10FFFE
and 0x10FFFF
, but I can't find a complete list as to where the last characters are of each plane, as I can't tell where each plane ends.
On Unicodes website it refers to the last two characters of every plane being non characters.
Upvotes: 2
Views: 1965
Reputation: 2603
The official source can already be found in http://unicode.org/charts/index.html; search up for "Noncharacters in Charts." In fact, the noncharacters at the end of Plane 3 to D [as of Unicode 12.1] are the only designated code points in these planes.
There are exactly 66 noncharacters in Unicode. There are 34 noncharacters residing at the final two code points of each of the 17 planes, and there is an additional contiguous range of 32 noncharacters from U+FDD0 to U+FDEF in the Arabic Presentation Forms-B block.
Any code point ending with FFFE or FFFF is a noncharacter. For the exceptions, any 4-digit code point beginning with FDD or FDE is a noncharacter.
I'll enumerate the noncharacters:
Upvotes: 2
Reputation:
..., as I can't tell where each plane ends.
Every plane by definition ends at U+xxFFFF
.
On Unicodes website it refers to the last two characters of every plane being non characters.
No. The Unicode Standard Version 9.0 - Core Specification says (in section 23.7 Noncharacters):
The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF. For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not “Arabic noncharacters” or “right-to-left noncharacters,” and are not distinguished in any other way from the other noncharacters, except in their code point values.
Note the keyword "code points", not "characters", they are always U+xxFFFE and U+xxFFFF.
Upvotes: 2
Reputation: 21249
Each Unicode plane contains 216 code points, starting from 0x000000
, and the last two characters of each plane are noncharacters. Therefore, all 0x••FFFE
and 0x••FFFF
code points are noncharacters, where ••
is anything from 0x00
through 0x10
(identifying the plane).
Upvotes: 0
Reputation: 201768
The Unicode Character Database contains authoritative information on the status of each code point. Using it, you can determine the last assigned code point of each plane. This may (actually, will) change over time, as new characters are assigned. You would also need to define what you mean by “character” – in particular, whether you regard Private Use code points as “characters”.
Upvotes: 0