Reputation: 15151
I thought [[:space:]]
matches all space-like characters, but "zero width space" is a exception.
# normal space
32.chr('UTF-8').match?(/[[:space:]]/) #=> true
# no break space
160.chr('UTF-8').match?(/[[:space:]]/) #=> true
# en space
8194.chr('UTF-8').match?(/[[:space:]]/) #=> true
# em space
8195.chr('UTF-8').match?(/[[:space:]]/) #=> true
# thin space
8201.chr('UTF-8').match?(/[[:space:]]/) #=> true
# ideographic space
12288.chr('UTF-8').match?(/[[:space:]]/) #=> true
# zero width space
8203.chr('UTF-8').match?(/[[:space:]]/) #=> false
# zero width no break space
65279.chr('UTF-8').match?(/[[:space:]]/) #=> false
How can I write a regular expression that matchs all these spaces?
Upvotes: 2
Views: 946
Reputation:
Per request.
Doing a Unicode 9 query from the UCD database, these properties all show up
for space.
Running the regex returns 28 characters.
If you run just a \s
, 29 characters show up.
However, if you combine them, it yeilds 32 characters.
Which I assume is the complete set of whitespace.
[\x{9}-\x{D}\x{1C}-\x{20}\x{85}\x{A0}\x{1680}\x{2000}-\x{200B}\x{200E}-\x{200F}\x{2028}-\x{2029}\x{202F}\x{205F}\x{3000}]
Or
[\s\p{White_Space}\p{Pattern_White_Space}\p{Bidi_Class=White_Space}\p{General_Category=Space_Separator}\p{Line_Break=Space}\p{Line_Break=ZWSpace}]
References:
http://www.regexformat.com/scrn8/UCDusage.htm
http://www.regexformat.com/scrn8/Uusage.jpg
Here is the complete list
000009 <control-0009>
00000A <control-000A>
00000B <control-000B>
00000C <control-000C>
00000D <control-000D>
00001C <control-001C>
00001D <control-001D>
00001E <control-001E>
00001F <control-001F>
000020 SPACE
000085 <control-0085>
0000A0 NO-BREAK SPACE
001680 OGHAM SPACE MARK
002000 EN QUAD
002001 EM QUAD
002002 EN SPACE
002003 EM SPACE
002004 THREE-PER-EM SPACE
002005 FOUR-PER-EM SPACE
002006 SIX-PER-EM SPACE
002007 FIGURE SPACE
002008 PUNCTUATION SPACE
002009 THIN SPACE
00200A HAIR SPACE
00200B ZERO WIDTH SPACE
00200E LEFT-TO-RIGHT MARK
00200F RIGHT-TO-LEFT MARK
002028 LINE SEPARATOR
002029 PARAGRAPH SEPARATOR
00202F NARROW NO-BREAK SPACE
00205F MEDIUM MATHEMATICAL SPACE
003000 IDEOGRAPHIC SPACE
Upvotes: 1
Reputation: 121000
Unfortunately, both zero-width spaces are not considered to be blank spaces, but “Other:Format” characters.
That corresponds the specification, CtrlF for 200B
, it’s entitled as “Format characters.” Since you want to match ZWSP, I do not see any reason to not match all the format characters, what could be done with:
/\p{Zs}|\p{Cf}/ =~ 65279.chr('UTF-8')
#⇒ 0
Please also note, that any explicit enumeration of characters is a very bad idea while dealing with Unicode. The specification changes quite often and the explicit list of characters will become obsolete in this context like tomorrow morning.
There are two general approaches to deal with this:
[[:space:]]
or \p{Zs}
.)Upvotes: 5