Reputation: 71
Trying to to match the hash character fails, but succeeds for any other member of the regex.
Why does this fail?
Thanks,
Joe
UNIT = [ 'floor', 'fl', '#', 'penthouse', 'mezzanine', 'basement', 'room' ]
unit_regex = "\\b(" + UNIT.to_a.join("|") + ")\\b"
unit_regexp = Regexp.new(unit_regex, Regexp::IGNORECASE)
x=unit_regexp.match('#')
Upvotes: 3
Views: 101
Reputation: 434965
As noted in the comments, your problem is that \b
is a word boundary inside a regex (unless it is inside a character class, sigh, the \b
in /[\b]/
is a backspace just like in a double quoted string). A word boundary is roughly
a word character on one side and nothing or a non-word character on the other side
But #
is not a word character so /\b/
can't match '#'
at all and your whole regex fails to match.
You're going to have to be more explicit about what you're trying to match. A first stab would be "the beginning of the string or whitespace" instead of the first \b
and "the end of the string or whitespace" instead of the second \b
. That could be expressed like this:
unit_regex = '(?<=\A|\s)(' + UNIT.to_a.join('|') + ')(?=\z|\s)'
Note that I've switched to single quotes to avoid all the double escaping hassles. The ?<=
is a positive lookbehind, that means that (\A|\s)
needs to be there but it won't be matched by the expression; similarly, ?=
is a positive lookahead. See the manual for more details. Also note that we're using \A
rather than ^
since ^
matches the beginning of a line not the string; similarly, \z
instead of $
because \z
matches the end of the string whereas $
matches the end of a line.
You may need to tweak the regex depending on your data but hopefully that will get you started.
Upvotes: 4