Joe McGuckin
Joe McGuckin

Reputation: 71

Why doesn't this regex work with Ruby

Trying to to match the hash character fails, but succeeds for any other member of the regex.

Why does this fail?

Thanks,

Joe

UNIT = [ 'floor', 'fl', '#', 'penthouse', 'mezzanine', 'basement', 'room' ]

unit_regex = "\\b(" + UNIT.to_a.join("|") + ")\\b"

unit_regexp = Regexp.new(unit_regex, Regexp::IGNORECASE)

x=unit_regexp.match('#')

Upvotes: 3

Views: 101

Answers (1)

mu is too short
mu is too short

Reputation: 434965

As noted in the comments, your problem is that \b is a word boundary inside a regex (unless it is inside a character class, sigh, the \b in /[\b]/ is a backspace just like in a double quoted string). A word boundary is roughly

a word character on one side and nothing or a non-word character on the other side

But # is not a word character so /\b/ can't match '#' at all and your whole regex fails to match.

You're going to have to be more explicit about what you're trying to match. A first stab would be "the beginning of the string or whitespace" instead of the first \b and "the end of the string or whitespace" instead of the second \b. That could be expressed like this:

unit_regex = '(?<=\A|\s)(' + UNIT.to_a.join('|') + ')(?=\z|\s)'

Note that I've switched to single quotes to avoid all the double escaping hassles. The ?<= is a positive lookbehind, that means that (\A|\s) needs to be there but it won't be matched by the expression; similarly, ?= is a positive lookahead. See the manual for more details. Also note that we're using \A rather than ^ since ^ matches the beginning of a line not the string; similarly, \z instead of $ because \z matches the end of the string whereas $ matches the end of a line.

You may need to tweak the regex depending on your data but hopefully that will get you started.

Upvotes: 4

Related Questions