Robin
Robin

Reputation: 21884

How to validate the format of a string in Ruby, while extracting the matches?

What I want

My issue

When using match, it just matches the last occurence:

/^(#\d\s*)+$/.match "#1 #2"
# => #<MatchData "#1 #2" 1:"#2">

When I use scan, it "works":

"#1 #2".scan /#\d/
# => ["#1", "#2"]

But I dont believe I can validate the format of the string, as it would return the same for "aaa #1 #2".

The question

Can I, with only 1 method call, both validates that my string matches /^(#\d\s*)+$/ AND grab all the instances of #number?

I kinda feel bad about asking this since I've been using ruby for a while now. It seems simple but I can't get that to work.

Upvotes: 3

Views: 330

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

def doit(str)
  r = /\A#{"(#\\d)\\s*"*str.count('#')}\z/      
  str.match(r)&.captures
end

doit "#1#2 #3 "    #=> ["#1", "#2", "#3"]
doit " #1#2 #3 "   #=> nil

Notice the regular expressions depend only on the number of instances of the character '#' in the string. As that number is three in both examples the respective regular expressions are equal, namely:

/\A(#\d)\s*(#\d)\s*(#\d)\s*\z/

This regular expression was constructed as follows.

str = "#1#2 #3 "
n = str.count('#')
  #=> 3
s = "(#\\d)\\s*"*n
  #=> "(#\\d)\\s*(#\\d)\\s*(#\\d)\\s*" 
/\A#{s}\z/ 
  #=> /\A(#\d)\s*(#\d)\s*(#\d)\s*\z/ 

The regular expression reads, "match the beginning of the string followed by three identical capture groups, each optionally followed by spaces, followed by the end of the string. The regular expression therefore both tests the validity of the string and extracts the desired matches in the capture groups.

The safe navigation operator, & is needed in the event that there is no match (match returns nil).

A comment by the OP refers to a generalisation of the question in which the pound character ('#') is optional. That can be dealt with by modifying the regular expression as follows.

def doit(str)
  r = /\A#{"(?:#?(\\d)(?=#|\\s+|\\z)\\s*)"*str.count('0123456789')}\z/
  str.match(r)&.captures
end

doit "1 2 #3 "     #=> ["1", "2", "3"] 
doit "1 2 #3 "     #=> ["1", "2", "3"] 
doit "1#2"         #=> ["1", "2"] 
doit " #1 2 #3 "   #=> nil   
doit "#1 2# 3 "    #=> nil 
doit " #1 23 #3 "  #=> nil 

For strings containing three digits the regular expression is:

/\A(?:#?(\d)(?=#|\s+|\z)\s*)(?:#?(\d)(?=#|\s+|\z)\s*)(?:#?(\d)(?=#|\s+|\z)\s*)\z/ 

While it is true that this regular expression can potentially be quite long, that does not necessarily mean that it would be relatively inefficient, as the lookaheads are quite localized.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

Yes, you may use

s.scan(/(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z))\s*\K#\d/)

See the regex demo

Details

  • (?:\G(?!\A)|\A(?=(?:#\d\s*)*\z)) - two alternatives:
    • \G(?!\A) - the end of the previous successful match
    • | - or
    • \A(?=(?:#\d\s*)*\z) - start of string (\A) that is followed with 0 or more repetitions of # + digit + 0+ whitespaces and then followed with the end of string
  • \s* - 0+ whitespace chars
  • \K - match reset operator discarding the text matched so far
  • #\d - a # char and then a digit

In short: the start of string position is matched first, but only if the string to the right (i.e. the whole string) matches the pattern you want. Since that check is performed with a lookahead, the regex index stays where it was, and then matching occurs all the time ONLY after a valid match thanks to the \G operator (it matches the start of string or end of previous match, so (?!\A) is used to subtract the start string position).

Ruby demo:

rx = /(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z))\s*\K#\d/
p "#1 #2".scan(rx)
# => ["#1", "#2"]
p "#1 NO #2".scan(rx)
# => []

Upvotes: 3

Related Questions