What is the best way to match space-like chars?

Question

I thought [[:space:]] matches all space-like characters, but "zero width space" is a exception.

# normal space
32.chr('UTF-8').match?(/[[:space:]]/) #=> true
# no break space
160.chr('UTF-8').match?(/[[:space:]]/) #=> true
# en space 
8194.chr('UTF-8').match?(/[[:space:]]/) #=> true
# em space
8195.chr('UTF-8').match?(/[[:space:]]/) #=> true
# thin space
8201.chr('UTF-8').match?(/[[:space:]]/) #=> true
# ideographic space
12288.chr('UTF-8').match?(/[[:space:]]/) #=> true
# zero width space
8203.chr('UTF-8').match?(/[[:space:]]/) #=> false
# zero width no break space
65279.chr('UTF-8').match?(/[[:space:]]/) #=> false

How can I write a regular expression that matchs all these spaces?

Aleksei Matiushkin · Accepted Answer

Unfortunately, both zero-width spaces are not considered to be blank spaces, but “Other:Format” characters.

That corresponds the specification, CtrlF for 200B, it’s entitled as “Format characters.” Since you want to match ZWSP, I do not see any reason to not match all the format characters, what could be done with:

/\p{Zs}|\p{Cf}/ =~ 65279.chr('UTF-8')
#⇒ 0

Please also note, that any explicit enumeration of characters is a very bad idea while dealing with Unicode. The specification changes quite often and the explicit list of characters will become obsolete in this context like tomorrow morning.

There are two general approaches to deal with this:

parse consortium specs (e.g. elixir does that to ensure the proper handling of latest version of the unicode,)
use generic “groups” (e.g. [[:space:]] or \p{Zs}.)

What is the best way to match space-like chars?

Answers (2)

Related Questions