Using General Unicode Properties

Question

I am trying to take advantage of the regex functionality : \p{UNICODE PROPERTY NAME}

However, I am struggling with understanding the a mapping of those property names.

I went direct to the Unicode.org website ( http://www.unicode.org/Public/UCD/latest/ucd/) and downloaded a file 'UnicodeData.txt' which has the catagory listed... but this only shows 27,268 character values.

But I understand there are 65k characters in utf-8 or ucs-2 .... so I am confused why the Unicode.org download only has 24k rows.

... am I missing a point here somewhere ?

I am sure I'm just being blind to something simple here ... if someone can help me understand.... I'd be grateful !

Boldewyn · Accepted Answer

Everything is fine so far. The characters you see are all but the CJK ones (Chinese-Japanese-Korean). The Unicode consortium let those out of the main UnicodeData file to keep it at a reasonable size.

If you want to look up properties for single characters only (and not for bulks), you can use websites, that prepare that data for you, like Graphemica, FileFormat or (my own) Codepoints.net.

If, however, you need bulk lookups, Unicode also provides the data as an XML file with a specific syntax, that groups codepoints together. That might be the best choice for processing the data.

Using General Unicode Properties

Answers (1)

Related Questions