Jacob Wegelin
Jacob Wegelin

Reputation: 1306

what is the most elegant way to manipulate an arbitrary unicode character in perl regex?

Consider a unicode character, such as zero-width space, which is not on any conventional keyboard and is not part of any human writing system. Suppose one wants to use perl to remove this character from a string, or one wants to print the character in bash unix.

This post reviews how one can do these things using hexadecimal code, and then asks: Is there a more direct (or elegant) way to do these things, using perhaps the decimal representation of the character?

The "zero-width space" http://www.unicode-symbol.com/u/200B.html shows up occasionally in text files.

For instance, on a macbook pro, from Messages.app, I saved an sms conversation as pdf. Then I opened the pdf in Preview, copied all, and pasted the clipboard into a file z. Then less z showed many instances of <U+200B>, and when I opened it in vim it showed up as <200b>.

Similarly, "pop directional formatting", http://www.unicode-symbol.com/u/202C.html, shows up when I copy and paste a phone number from the telephone field of Contacts.app.

Often I want to get the plain text from a string---anything that a human being would actually want to read, including letters in any language such as French é, Greek β, Arabic, Chinese and of course tab, space, and newline---without other characters.

This is because the other characters can cause problems. Not only are they a distraction in less and vim, but they seem to cause LaTeX, pdflatex, to throw an error.

One can remove "zero-length space" as follows:

  1. go to the url for the character, as cited above
  2. scroll down to the table titled "Encodings (Unicode characters converter)"
  3. on the UTF-8 row, find the text "E2 80 8B"
  4. By hand, convert this to \xe2\x80\x8b
  5. perl -p -e 's/\xe2\x80\x8b//g;' myfile

Using the same approach, one can print the character:

printf '\xe2\x80\x8b'

But on the same row in http://www.unicode-symbol.com/u/200B.html where one obtains the triad of hexadecimal numbers, one also finds that the decimal representation is 14844043. Is there a way to use this decimal representation, or some other approach more direct than pasting together three hexadecimal codes?

Upvotes: 1

Views: 488

Answers (2)

choroba
choroba

Reputation: 241858

Elegance is in the eye of the beholder.

But, the -C switch enables Perl's unicode handling, so you can take advantage of that.

perl -CD -wpe 's/\x{200B}//g' file

Also, you can use \N to specify the full names of the characters:

perl -CD -wpe 's/\N{ZERO WIDTH SPACE}//g' file

See perlrun for the explanation of the details of -C. In particular, -CD is equivalent to -Cio, which means "make UTF-8 the default PerlIO layer for input and output streams".

Upvotes: 4

ikegami
ikegami

Reputation: 385744

To remove U+200B ZERO WIDTH SPACE specifically:

perl -CSD -pe's/\x{200B}//g'
perl -CSD -pe's/\N{U+200B}//g'
perl -CSD -pe's/\N{ZERO WIDTH SPACE}//g'

-CSD handles encoding/decoding STDIN/STDOUT/STDERR/ARGV. (UTF-8, specifically.)

Specifying file to process to Perl one-liner.


That said, it sounds like you want a more general approach that would match "characters like ZERO WIDTH SPACE", not just ZERO WIDTH SPACE. But it's unclear what that means. Here are the properties ZERO WIDTH SPACE has:

$ uniprops -a1 200B
U+200B ‹U+200B› \N{ZERO WIDTH SPACE}
\pC
\p{Cf}
All
Any
Assigned
C
Other
Case_Ignorable
CI
Cf
Format
Changes_When_NFKC_Casefolded
CWKCF
Common
Zyyy
Default_Ignorable_Code_Point
DI
General_Punctuation
InPunctuation
Graph
X_POSIX_Graph
Print
X_POSIX_Print
Unicode
Age=1.1
Age=V1_1
Bidi_Class=BN
Bidi_Class=Boundary_Neutral
BC=BN
Bidi_Paired_Bracket_Type=None
Block=General_Punctuation
BLK=Punctuation
Block=Punctuation
Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered
CCC=NR
Canonical_Combining_Class=NR
Script_Extensions=Common
Decomposition_Type=None
DT=None
East_Asian_Width=Neutral
Grapheme_Cluster_Break=CN
Grapheme_Cluster_Break=Control
GCB=CN
Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable
HST=NA
Identifier_Status=Restricted
Identifier_Type=Default_Ignorable
Indic_Positional_Category=NA
InPC=NA
Indic_Syllabic_Category=Other
InSC=Other
Joining_Group=No_Joining_Group
JG=NoJoiningGroup
Joining_Type=T
Joining_Type=Transparent
JT=T
Line_Break=ZW
Line_Break=ZWSpace
LB=ZW
Numeric_Type=None
NT=None
Numeric_Value=NaN
NV=NaN
Present_In=1.1
IN=1.1
Present_In=2.0
IN=2.0
Present_In=V2_0
Present_In=2.1
IN=2.1
Present_In=V2_1
Present_In=3.0
IN=3.0
Present_In=V3_0
Present_In=3.1
IN=3.1
Present_In=V3_1
Present_In=3.2
IN=3.2
Present_In=V3_2
Present_In=4.0
IN=4.0
Present_In=V4_0
Present_In=4.1
IN=4.1
Present_In=V4_1
Present_In=5.0
IN=5.0
Present_In=V5_0
Present_In=5.1
IN=5.1
Present_In=V5_1
Present_In=5.2
IN=5.2
Present_In=V5_2
Present_In=6.0
IN=6.0
Present_In=V6_0
Present_In=6.1
IN=6.1
Present_In=V6_1
Present_In=6.2
IN=6.2
Present_In=V6_2
Present_In=6.3
IN=6.3
Present_In=V6_3
Present_In=7.0
IN=7.0
Present_In=V7_0
Present_In=8.0
IN=8.0
Present_In=V8_0
Present_In=9.0
IN=9.0
Present_In=V9_0
Present_In=10.0
IN=10.0
Present_In=V10_0
Present_In=11.0
IN=11.0
Present_In=V11_0
Present_In=12.0
IN=12.0
Present_In=V12_0
Present_In=12.1
IN=12.1
Present_In=V12_1
Present_In=13.0
IN=13.0
Present_In=V13_0
Script=Common
SC=Zyyy
Script=Zyyy
Scx=Zyyy
Script_Extensions=Zyyy
Sentence_Break=FO
Sentence_Break=Format
SB=FO
Vertical_Orientation=R
Vertical_Orientation=Rotated
Vo=R
Word_Break=Other
WB=XX
Word_Break=XX

The first two might be the ones of interest two.


\p{General_Category=Format} aka \p{Gc=Cf} aka \p{Format} aka \p{Cf}

perl -CSD -pe's/\p{Cf}//g'

This property is shared by the following 161 Code Points:

$ unichars -a '\p{Cf}' | cat
 ---- U+000AD SOFT HYPHEN
 ---- U+00600 ARABIC NUMBER SIGN
 ---- U+00601 ARABIC SIGN SANAH
 ---- U+00602 ARABIC FOOTNOTE MARKER
 ---- U+00603 ARABIC SIGN SAFHA
 ---- U+00604 ARABIC SIGN SAMVAT
 ---- U+00605 ARABIC NUMBER MARK ABOVE
 ---- U+0061C ARABIC LETTER MARK
 ---- U+006DD ARABIC END OF AYAH
 ---- U+0070F SYRIAC ABBREVIATION MARK
 ---- U+008E2 ARABIC DISPUTED END OF AYAH
 ---- U+0180E MONGOLIAN VOWEL SEPARATOR
 ---- U+0200B ZERO WIDTH SPACE
 ---- U+0200C ZERO WIDTH NON-JOINER
 ---- U+0200D ZERO WIDTH JOINER
 ---- U+0200E LEFT-TO-RIGHT MARK
 ---- U+0200F RIGHT-TO-LEFT MARK
 ---- U+0202A LEFT-TO-RIGHT EMBEDDING
 ---- U+0202B RIGHT-TO-LEFT EMBEDDING
 ---- U+0202C POP DIRECTIONAL FORMATTING
 ---- U+0202D LEFT-TO-RIGHT OVERRIDE
 ---- U+0202E RIGHT-TO-LEFT OVERRIDE
 ---- U+02060 WORD JOINER
 ---- U+02061 FUNCTION APPLICATION
 ---- U+02062 INVISIBLE TIMES
 ---- U+02063 INVISIBLE SEPARATOR
 ---- U+02064 INVISIBLE PLUS
 ---- U+02066 LEFT-TO-RIGHT ISOLATE
 ---- U+02067 RIGHT-TO-LEFT ISOLATE
 ---- U+02068 FIRST STRONG ISOLATE
 ---- U+02069 POP DIRECTIONAL ISOLATE
 ---- U+0206A INHIBIT SYMMETRIC SWAPPING
 ---- U+0206B ACTIVATE SYMMETRIC SWAPPING
 ---- U+0206C INHIBIT ARABIC FORM SHAPING
 ---- U+0206D ACTIVATE ARABIC FORM SHAPING
 ---- U+0206E NATIONAL DIGIT SHAPES
 ---- U+0206F NOMINAL DIGIT SHAPES
 ---- U+0FEFF ZERO WIDTH NO-BREAK SPACE
 ---- U+0FFF9 INTERLINEAR ANNOTATION ANCHOR
 ---- U+0FFFA INTERLINEAR ANNOTATION SEPARATOR
 ---- U+0FFFB INTERLINEAR ANNOTATION TERMINATOR
 ---- U+110BD KAITHI NUMBER SIGN
 ---- U+110CD KAITHI NUMBER SIGN ABOVE
 ---- U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER
 ---- U+13431 EGYPTIAN HIEROGLYPH HORIZONTAL JOINER
 ---- U+13432 EGYPTIAN HIEROGLYPH INSERT AT TOP START
 ---- U+13433 EGYPTIAN HIEROGLYPH INSERT AT BOTTOM START
 ---- U+13434 EGYPTIAN HIEROGLYPH INSERT AT TOP END
 ---- U+13435 EGYPTIAN HIEROGLYPH INSERT AT BOTTOM END
 ---- U+13436 EGYPTIAN HIEROGLYPH OVERLAY MIDDLE
 ---- U+13437 EGYPTIAN HIEROGLYPH BEGIN SEGMENT
 ---- U+13438 EGYPTIAN HIEROGLYPH END SEGMENT
 ---- U+1BCA0 SHORTHAND FORMAT LETTER OVERLAP
 ---- U+1BCA1 SHORTHAND FORMAT CONTINUING OVERLAP
 ---- U+1BCA2 SHORTHAND FORMAT DOWN STEP
 ---- U+1BCA3 SHORTHAND FORMAT UP STEP
 ---- U+1D173 MUSICAL SYMBOL BEGIN BEAM
 ---- U+1D174 MUSICAL SYMBOL END BEAM
 ---- U+1D175 MUSICAL SYMBOL BEGIN TIE
 ---- U+1D176 MUSICAL SYMBOL END TIE
 ---- U+1D177 MUSICAL SYMBOL BEGIN SLUR
 ---- U+1D178 MUSICAL SYMBOL END SLUR
 ---- U+1D179 MUSICAL SYMBOL BEGIN PHRASE
 ---- U+1D17A MUSICAL SYMBOL END PHRASE
 ---- U+E0001 LANGUAGE TAG
 ---- U+E0020 TAG SPACE
 ---- U+E0021 TAG EXCLAMATION MARK
 ---- U+E0022 TAG QUOTATION MARK
 ---- U+E0023 TAG NUMBER SIGN
 ---- U+E0024 TAG DOLLAR SIGN
 ---- U+E0025 TAG PERCENT SIGN
 ---- U+E0026 TAG AMPERSAND
 ---- U+E0027 TAG APOSTROPHE
 ---- U+E0028 TAG LEFT PARENTHESIS
 ---- U+E0029 TAG RIGHT PARENTHESIS
 ---- U+E002A TAG ASTERISK
 ---- U+E002B TAG PLUS SIGN
 ---- U+E002C TAG COMMA
 ---- U+E002D TAG HYPHEN-MINUS
 ---- U+E002E TAG FULL STOP
 ---- U+E002F TAG SOLIDUS
 ---- U+E0030 TAG DIGIT ZERO
 ---- U+E0031 TAG DIGIT ONE
 ---- U+E0032 TAG DIGIT TWO
 ---- U+E0033 TAG DIGIT THREE
 ---- U+E0034 TAG DIGIT FOUR
 ---- U+E0035 TAG DIGIT FIVE
 ---- U+E0036 TAG DIGIT SIX
 ---- U+E0037 TAG DIGIT SEVEN
 ---- U+E0038 TAG DIGIT EIGHT
 ---- U+E0039 TAG DIGIT NINE
 ---- U+E003A TAG COLON
 ---- U+E003B TAG SEMICOLON
 ---- U+E003C TAG LESS-THAN SIGN
 ---- U+E003D TAG EQUALS SIGN
 ---- U+E003E TAG GREATER-THAN SIGN
 ---- U+E003F TAG QUESTION MARK
 ---- U+E0040 TAG COMMERCIAL AT
 ---- U+E0041 TAG LATIN CAPITAL LETTER A
 ---- U+E0042 TAG LATIN CAPITAL LETTER B
 ---- U+E0043 TAG LATIN CAPITAL LETTER C
 ---- U+E0044 TAG LATIN CAPITAL LETTER D
 ---- U+E0045 TAG LATIN CAPITAL LETTER E
 ---- U+E0046 TAG LATIN CAPITAL LETTER F
 ---- U+E0047 TAG LATIN CAPITAL LETTER G
 ---- U+E0048 TAG LATIN CAPITAL LETTER H
 ---- U+E0049 TAG LATIN CAPITAL LETTER I
 ---- U+E004A TAG LATIN CAPITAL LETTER J
 ---- U+E004B TAG LATIN CAPITAL LETTER K
 ---- U+E004C TAG LATIN CAPITAL LETTER L
 ---- U+E004D TAG LATIN CAPITAL LETTER M
 ---- U+E004E TAG LATIN CAPITAL LETTER N
 ---- U+E004F TAG LATIN CAPITAL LETTER O
 ---- U+E0050 TAG LATIN CAPITAL LETTER P
 ---- U+E0051 TAG LATIN CAPITAL LETTER Q
 ---- U+E0052 TAG LATIN CAPITAL LETTER R
 ---- U+E0053 TAG LATIN CAPITAL LETTER S
 ---- U+E0054 TAG LATIN CAPITAL LETTER T
 ---- U+E0055 TAG LATIN CAPITAL LETTER U
 ---- U+E0056 TAG LATIN CAPITAL LETTER V
 ---- U+E0057 TAG LATIN CAPITAL LETTER W
 ---- U+E0058 TAG LATIN CAPITAL LETTER X
 ---- U+E0059 TAG LATIN CAPITAL LETTER Y
 ---- U+E005A TAG LATIN CAPITAL LETTER Z
 ---- U+E005B TAG LEFT SQUARE BRACKET
 ---- U+E005C TAG REVERSE SOLIDUS
 ---- U+E005D TAG RIGHT SQUARE BRACKET
 ---- U+E005E TAG CIRCUMFLEX ACCENT
 ---- U+E005F TAG LOW LINE
 ---- U+E0060 TAG GRAVE ACCENT
 ---- U+E0061 TAG LATIN SMALL LETTER A
 ---- U+E0062 TAG LATIN SMALL LETTER B
 ---- U+E0063 TAG LATIN SMALL LETTER C
 ---- U+E0064 TAG LATIN SMALL LETTER D
 ---- U+E0065 TAG LATIN SMALL LETTER E
 ---- U+E0066 TAG LATIN SMALL LETTER F
 ---- U+E0067 TAG LATIN SMALL LETTER G
 ---- U+E0068 TAG LATIN SMALL LETTER H
 ---- U+E0069 TAG LATIN SMALL LETTER I
 ---- U+E006A TAG LATIN SMALL LETTER J
 ---- U+E006B TAG LATIN SMALL LETTER K
 ---- U+E006C TAG LATIN SMALL LETTER L
 ---- U+E006D TAG LATIN SMALL LETTER M
 ---- U+E006E TAG LATIN SMALL LETTER N
 ---- U+E006F TAG LATIN SMALL LETTER O
 ---- U+E0070 TAG LATIN SMALL LETTER P
 ---- U+E0071 TAG LATIN SMALL LETTER Q
 ---- U+E0072 TAG LATIN SMALL LETTER R
 ---- U+E0073 TAG LATIN SMALL LETTER S
 ---- U+E0074 TAG LATIN SMALL LETTER T
 ---- U+E0075 TAG LATIN SMALL LETTER U
 ---- U+E0076 TAG LATIN SMALL LETTER V
 ---- U+E0077 TAG LATIN SMALL LETTER W
 ---- U+E0078 TAG LATIN SMALL LETTER X
 ---- U+E0079 TAG LATIN SMALL LETTER Y
 ---- U+E007A TAG LATIN SMALL LETTER Z
 ---- U+E007B TAG LEFT CURLY BRACKET
 ---- U+E007C TAG VERTICAL LINE
 ---- U+E007D TAG RIGHT CURLY BRACKET
 ---- U+E007E TAG TILDE
 ---- U+E007F CANCEL TAG

\p{General_Category=Other} aka \p{Gc=C} aka \p{Other} aka \p{C} aka \pC

perl -CSD -pe's/\pC//g'

\p{General_Category=Other} (\pC) includes:

  • \p{General_Category=Control} (\p{Cc}): 65 Code Points
  • \p{General_Category=Format} (\p{Cf}): 161 Code Points [Mentioned above]
  • \p{General_Category=Private_Use} (\p{Co}): 137,468 Code Points
  • \p{General_Category=Unassigned} (\p{Cn}): 830,672 Code Points
  • \p{General_Category=Surrogate} (\p{Cs}): 2,048 Code Points

Of those 970,414, the following are the 226 named ones (equivalent to [\p{Cc}\p{Cf}]):

$ unichars -a '\pC' | cat
 ---- U+00000 NULL
 ---- U+00001 START OF HEADING
 ---- U+00002 START OF TEXT
 ---- U+00003 END OF TEXT
 ---- U+00004 END OF TRANSMISSION
 ---- U+00005 ENQUIRY
 ---- U+00006 ACKNOWLEDGE
 ---- U+00007 ALERT
 ---- U+00008 BACKSPACE
 ---- U+00009 CHARACTER TABULATION
 ---- U+0000A LINE FEED
 ---- U+0000B LINE TABULATION
 ---- U+0000C FORM FEED
 ---- U+0000D CARRIAGE RETURN
 ---- U+0000E SHIFT OUT
 ---- U+0000F SHIFT IN
 ---- U+00010 DATA LINK ESCAPE
 ---- U+00011 DEVICE CONTROL ONE
 ---- U+00012 DEVICE CONTROL TWO
 ---- U+00013 DEVICE CONTROL THREE
 ---- U+00014 DEVICE CONTROL FOUR
 ---- U+00015 NEGATIVE ACKNOWLEDGE
 ---- U+00016 SYNCHRONOUS IDLE
 ---- U+00017 END OF TRANSMISSION BLOCK
 ---- U+00018 CANCEL
 ---- U+00019 END OF MEDIUM
 ---- U+0001A SUBSTITUTE
 ---- U+0001B ESCAPE
 ---- U+0001C INFORMATION SEPARATOR FOUR
 ---- U+0001D INFORMATION SEPARATOR THREE
 ---- U+0001E INFORMATION SEPARATOR TWO
 ---- U+0001F INFORMATION SEPARATOR ONE
 ---- U+0007F DELETE
 ---- U+00080 PADDING CHARACTER
 ---- U+00081 HIGH OCTET PRESET
 ---- U+00082 BREAK PERMITTED HERE
 ---- U+00083 NO BREAK HERE
 ---- U+00084 INDEX
 ---- U+00085 NEXT LINE
 ---- U+00086 START OF SELECTED AREA
 ---- U+00087 END OF SELECTED AREA
 ---- U+00088 CHARACTER TABULATION SET
 ---- U+00089 CHARACTER TABULATION WITH JUSTIFICATION
 ---- U+0008A LINE TABULATION SET
 ---- U+0008B PARTIAL LINE FORWARD
 ---- U+0008C PARTIAL LINE BACKWARD
 ---- U+0008D REVERSE LINE FEED
 ---- U+0008E SINGLE SHIFT TWO
 ---- U+0008F SINGLE SHIFT THREE
 ---- U+00090 DEVICE CONTROL STRING
 ---- U+00091 PRIVATE USE ONE
 ---- U+00092 PRIVATE USE TWO
 ---- U+00093 SET TRANSMIT STATE
 ---- U+00094 CANCEL CHARACTER
 ---- U+00095 MESSAGE WAITING
 ---- U+00096 START OF GUARDED AREA
 ---- U+00097 END OF GUARDED AREA
 ---- U+00098 START OF STRING
 ---- U+00099 SINGLE GRAPHIC CHARACTER INTRODUCER
 ---- U+0009A SINGLE CHARACTER INTRODUCER
 ---- U+0009B CONTROL SEQUENCE INTRODUCER
 ---- U+0009C STRING TERMINATOR
 ---- U+0009D OPERATING SYSTEM COMMAND
 ---- U+0009E PRIVACY MESSAGE
 ---- U+0009F APPLICATION PROGRAM COMMAND
 ---- U+000AD SOFT HYPHEN
 ---- U+00600 ARABIC NUMBER SIGN
 ---- U+00601 ARABIC SIGN SANAH
 ---- U+00602 ARABIC FOOTNOTE MARKER
 ---- U+00603 ARABIC SIGN SAFHA
 ---- U+00604 ARABIC SIGN SAMVAT
 ---- U+00605 ARABIC NUMBER MARK ABOVE
 ---- U+0061C ARABIC LETTER MARK
 ---- U+006DD ARABIC END OF AYAH
 ---- U+0070F SYRIAC ABBREVIATION MARK
 ---- U+008E2 ARABIC DISPUTED END OF AYAH
 ---- U+0180E MONGOLIAN VOWEL SEPARATOR
 ---- U+0200B ZERO WIDTH SPACE
 ---- U+0200C ZERO WIDTH NON-JOINER
 ---- U+0200D ZERO WIDTH JOINER
 ---- U+0200E LEFT-TO-RIGHT MARK
 ---- U+0200F RIGHT-TO-LEFT MARK
 ---- U+0202A LEFT-TO-RIGHT EMBEDDING
 ---- U+0202B RIGHT-TO-LEFT EMBEDDING
 ---- U+0202C POP DIRECTIONAL FORMATTING
 ---- U+0202D LEFT-TO-RIGHT OVERRIDE
 ---- U+0202E RIGHT-TO-LEFT OVERRIDE
 ---- U+02060 WORD JOINER
 ---- U+02061 FUNCTION APPLICATION
 ---- U+02062 INVISIBLE TIMES
 ---- U+02063 INVISIBLE SEPARATOR
 ---- U+02064 INVISIBLE PLUS
 ---- U+02066 LEFT-TO-RIGHT ISOLATE
 ---- U+02067 RIGHT-TO-LEFT ISOLATE
 ---- U+02068 FIRST STRONG ISOLATE
 ---- U+02069 POP DIRECTIONAL ISOLATE
 ---- U+0206A INHIBIT SYMMETRIC SWAPPING
 ---- U+0206B ACTIVATE SYMMETRIC SWAPPING
 ---- U+0206C INHIBIT ARABIC FORM SHAPING
 ---- U+0206D ACTIVATE ARABIC FORM SHAPING
 ---- U+0206E NATIONAL DIGIT SHAPES
 ---- U+0206F NOMINAL DIGIT SHAPES
 ---- U+0FEFF ZERO WIDTH NO-BREAK SPACE
 ---- U+0FFF9 INTERLINEAR ANNOTATION ANCHOR
 ---- U+0FFFA INTERLINEAR ANNOTATION SEPARATOR
 ---- U+0FFFB INTERLINEAR ANNOTATION TERMINATOR
 ---- U+110BD KAITHI NUMBER SIGN
 ---- U+110CD KAITHI NUMBER SIGN ABOVE
 ---- U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER
 ---- U+13431 EGYPTIAN HIEROGLYPH HORIZONTAL JOINER
 ---- U+13432 EGYPTIAN HIEROGLYPH INSERT AT TOP START
 ---- U+13433 EGYPTIAN HIEROGLYPH INSERT AT BOTTOM START
 ---- U+13434 EGYPTIAN HIEROGLYPH INSERT AT TOP END
 ---- U+13435 EGYPTIAN HIEROGLYPH INSERT AT BOTTOM END
 ---- U+13436 EGYPTIAN HIEROGLYPH OVERLAY MIDDLE
 ---- U+13437 EGYPTIAN HIEROGLYPH BEGIN SEGMENT
 ---- U+13438 EGYPTIAN HIEROGLYPH END SEGMENT
 ---- U+1BCA0 SHORTHAND FORMAT LETTER OVERLAP
 ---- U+1BCA1 SHORTHAND FORMAT CONTINUING OVERLAP
 ---- U+1BCA2 SHORTHAND FORMAT DOWN STEP
 ---- U+1BCA3 SHORTHAND FORMAT UP STEP
 ---- U+1D173 MUSICAL SYMBOL BEGIN BEAM
 ---- U+1D174 MUSICAL SYMBOL END BEAM
 ---- U+1D175 MUSICAL SYMBOL BEGIN TIE
 ---- U+1D176 MUSICAL SYMBOL END TIE
 ---- U+1D177 MUSICAL SYMBOL BEGIN SLUR
 ---- U+1D178 MUSICAL SYMBOL END SLUR
 ---- U+1D179 MUSICAL SYMBOL BEGIN PHRASE
 ---- U+1D17A MUSICAL SYMBOL END PHRASE
 ---- U+E0001 LANGUAGE TAG
 ---- U+E0020 TAG SPACE
 ---- U+E0021 TAG EXCLAMATION MARK
 ---- U+E0022 TAG QUOTATION MARK
 ---- U+E0023 TAG NUMBER SIGN
 ---- U+E0024 TAG DOLLAR SIGN
 ---- U+E0025 TAG PERCENT SIGN
 ---- U+E0026 TAG AMPERSAND
 ---- U+E0027 TAG APOSTROPHE
 ---- U+E0028 TAG LEFT PARENTHESIS
 ---- U+E0029 TAG RIGHT PARENTHESIS
 ---- U+E002A TAG ASTERISK
 ---- U+E002B TAG PLUS SIGN
 ---- U+E002C TAG COMMA
 ---- U+E002D TAG HYPHEN-MINUS
 ---- U+E002E TAG FULL STOP
 ---- U+E002F TAG SOLIDUS
 ---- U+E0030 TAG DIGIT ZERO
 ---- U+E0031 TAG DIGIT ONE
 ---- U+E0032 TAG DIGIT TWO
 ---- U+E0033 TAG DIGIT THREE
 ---- U+E0034 TAG DIGIT FOUR
 ---- U+E0035 TAG DIGIT FIVE
 ---- U+E0036 TAG DIGIT SIX
 ---- U+E0037 TAG DIGIT SEVEN
 ---- U+E0038 TAG DIGIT EIGHT
 ---- U+E0039 TAG DIGIT NINE
 ---- U+E003A TAG COLON
 ---- U+E003B TAG SEMICOLON
 ---- U+E003C TAG LESS-THAN SIGN
 ---- U+E003D TAG EQUALS SIGN
 ---- U+E003E TAG GREATER-THAN SIGN
 ---- U+E003F TAG QUESTION MARK
 ---- U+E0040 TAG COMMERCIAL AT
 ---- U+E0041 TAG LATIN CAPITAL LETTER A
 ---- U+E0042 TAG LATIN CAPITAL LETTER B
 ---- U+E0043 TAG LATIN CAPITAL LETTER C
 ---- U+E0044 TAG LATIN CAPITAL LETTER D
 ---- U+E0045 TAG LATIN CAPITAL LETTER E
 ---- U+E0046 TAG LATIN CAPITAL LETTER F
 ---- U+E0047 TAG LATIN CAPITAL LETTER G
 ---- U+E0048 TAG LATIN CAPITAL LETTER H
 ---- U+E0049 TAG LATIN CAPITAL LETTER I
 ---- U+E004A TAG LATIN CAPITAL LETTER J
 ---- U+E004B TAG LATIN CAPITAL LETTER K
 ---- U+E004C TAG LATIN CAPITAL LETTER L
 ---- U+E004D TAG LATIN CAPITAL LETTER M
 ---- U+E004E TAG LATIN CAPITAL LETTER N
 ---- U+E004F TAG LATIN CAPITAL LETTER O
 ---- U+E0050 TAG LATIN CAPITAL LETTER P
 ---- U+E0051 TAG LATIN CAPITAL LETTER Q
 ---- U+E0052 TAG LATIN CAPITAL LETTER R
 ---- U+E0053 TAG LATIN CAPITAL LETTER S
 ---- U+E0054 TAG LATIN CAPITAL LETTER T
 ---- U+E0055 TAG LATIN CAPITAL LETTER U
 ---- U+E0056 TAG LATIN CAPITAL LETTER V
 ---- U+E0057 TAG LATIN CAPITAL LETTER W
 ---- U+E0058 TAG LATIN CAPITAL LETTER X
 ---- U+E0059 TAG LATIN CAPITAL LETTER Y
 ---- U+E005A TAG LATIN CAPITAL LETTER Z
 ---- U+E005B TAG LEFT SQUARE BRACKET
 ---- U+E005C TAG REVERSE SOLIDUS
 ---- U+E005D TAG RIGHT SQUARE BRACKET
 ---- U+E005E TAG CIRCUMFLEX ACCENT
 ---- U+E005F TAG LOW LINE
 ---- U+E0060 TAG GRAVE ACCENT
 ---- U+E0061 TAG LATIN SMALL LETTER A
 ---- U+E0062 TAG LATIN SMALL LETTER B
 ---- U+E0063 TAG LATIN SMALL LETTER C
 ---- U+E0064 TAG LATIN SMALL LETTER D
 ---- U+E0065 TAG LATIN SMALL LETTER E
 ---- U+E0066 TAG LATIN SMALL LETTER F
 ---- U+E0067 TAG LATIN SMALL LETTER G
 ---- U+E0068 TAG LATIN SMALL LETTER H
 ---- U+E0069 TAG LATIN SMALL LETTER I
 ---- U+E006A TAG LATIN SMALL LETTER J
 ---- U+E006B TAG LATIN SMALL LETTER K
 ---- U+E006C TAG LATIN SMALL LETTER L
 ---- U+E006D TAG LATIN SMALL LETTER M
 ---- U+E006E TAG LATIN SMALL LETTER N
 ---- U+E006F TAG LATIN SMALL LETTER O
 ---- U+E0070 TAG LATIN SMALL LETTER P
 ---- U+E0071 TAG LATIN SMALL LETTER Q
 ---- U+E0072 TAG LATIN SMALL LETTER R
 ---- U+E0073 TAG LATIN SMALL LETTER S
 ---- U+E0074 TAG LATIN SMALL LETTER T
 ---- U+E0075 TAG LATIN SMALL LETTER U
 ---- U+E0076 TAG LATIN SMALL LETTER V
 ---- U+E0077 TAG LATIN SMALL LETTER W
 ---- U+E0078 TAG LATIN SMALL LETTER X
 ---- U+E0079 TAG LATIN SMALL LETTER Y
 ---- U+E007A TAG LATIN SMALL LETTER Z
 ---- U+E007B TAG LEFT CURLY BRACKET
 ---- U+E007C TAG VERTICAL LINE
 ---- U+E007D TAG RIGHT CURLY BRACKET
 ---- U+E007E TAG TILDE
 ---- U+E007F CANCEL TAG

Upvotes: 4

Related Questions