ironsand
ironsand

Reputation: 15151

What is the best way to match space-like chars?

I thought [[:space:]] matches all space-like characters, but "zero width space" is a exception.

# normal space
32.chr('UTF-8').match?(/[[:space:]]/) #=> true
# no break space
160.chr('UTF-8').match?(/[[:space:]]/) #=> true
# en space 
8194.chr('UTF-8').match?(/[[:space:]]/) #=> true
# em space
8195.chr('UTF-8').match?(/[[:space:]]/) #=> true
# thin space
8201.chr('UTF-8').match?(/[[:space:]]/) #=> true
# ideographic space
12288.chr('UTF-8').match?(/[[:space:]]/) #=> true
# zero width space
8203.chr('UTF-8').match?(/[[:space:]]/) #=> false
# zero width no break space
65279.chr('UTF-8').match?(/[[:space:]]/) #=> false

How can I write a regular expression that matchs all these spaces?

Upvotes: 2

Views: 946

Answers (2)

user557597
user557597

Reputation:

Per request.

Doing a Unicode 9 query from the UCD database, these properties all show up
for space.

Running the regex returns 28 characters.

If you run just a \s, 29 characters show up.

However, if you combine them, it yeilds 32 characters.
Which I assume is the complete set of whitespace.

[\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200‌​B}\x{200E}-\x{200F}\‌​x{2028}-\x{2029}\x{2‌​02F}\x{205F}\x{3000}‌​]

Or

[\s\p{White_Space}\p{Pattern_White_Space}\p{Bidi_Class=White‌​_Space}\p{General_Ca‌​tegory=Space_Separat‌​or}\p{Line_Break=Spa‌​ce}\p{Line_Break=ZWS‌​pace}]

References:

http://www.regexformat.com/scrn8/UCDusage.htm
http://www.regexformat.com/scrn8/Uusage.jpg

Here is the complete list

000009    <control-0009>
00000A    <control-000A>
00000B    <control-000B>
00000C    <control-000C>
00000D    <control-000D>
00001C    <control-001C>
00001D    <control-001D>
00001E    <control-001E>
00001F    <control-001F>
000020    SPACE
000085    <control-0085>
0000A0    NO-BREAK SPACE
001680    OGHAM SPACE MARK
002000    EN QUAD
002001    EM QUAD
002002    EN SPACE
002003    EM SPACE
002004    THREE-PER-EM SPACE
002005    FOUR-PER-EM SPACE
002006    SIX-PER-EM SPACE
002007    FIGURE SPACE
002008    PUNCTUATION SPACE
002009    THIN SPACE
00200A    HAIR SPACE
00200B    ZERO WIDTH SPACE
00200E    LEFT-TO-RIGHT MARK
00200F    RIGHT-TO-LEFT MARK
002028    LINE SEPARATOR
002029    PARAGRAPH SEPARATOR
00202F    NARROW NO-BREAK SPACE
00205F    MEDIUM MATHEMATICAL SPACE
003000    IDEOGRAPHIC SPACE

Upvotes: 1

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

Unfortunately, both zero-width spaces are not considered to be blank spaces, but “Other:Format” characters.

That corresponds the specification, CtrlF for 200B, it’s entitled as “Format characters.” Since you want to match ZWSP, I do not see any reason to not match all the format characters, what could be done with:

/\p{Zs}|\p{Cf}/ =~ 65279.chr('UTF-8')
#⇒ 0

Please also note, that any explicit enumeration of characters is a very bad idea while dealing with Unicode. The specification changes quite often and the explicit list of characters will become obsolete in this context like tomorrow morning.

There are two general approaches to deal with this:

  • parse consortium specs (e.g. does that to ensure the proper handling of latest version of the unicode,)
  • use generic “groups” (e.g. [[:space:]] or \p{Zs}.)

Upvotes: 5

Related Questions