Dave
Dave

Reputation: 19120

How to use Regexp.union to match a character at the beginning of my string

I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:

2.4.0 :017 > MY_TOKENS = ["a", "b"]
 => ["a", "b"]
2.4.0 :018 > str = "40"
 => "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
 => nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
 => nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
 => nil

I'm stumped as to what I'm doing wrong.

Upvotes: 1

Views: 1075

Answers (2)

Eric Duminil
Eric Duminil

Reputation: 54223

If they are single characters, just use MY_TOKENS.join inside the character class:

MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/ 
puts str =~ first_regex
# 0

You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :

second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0

The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.

You could use Regexp#source to avoid this behaviour :

third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0

or simply build your regex union :

fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0

The 3 last examples should work fine if MY_TOKENS has words instead of just characters.

first_regex, third_regex and fourth_regex should all work fine with /i flag.

As an example :

first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.

Then you need to use

Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)

or

/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)

When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).

Do not forget to match the start of a string with \A and end of string with \z anchors.

Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.

To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):

Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")

or

/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i

The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

Upvotes: 1

Related Questions