code_t
code_t

Reputation: 49

multiline regular expression not working in ruby

I have written a regular expression in ruby that is working fine in single line but it is quite large so I need to write it in multi line form.

I am using %r{}x format to use it in multi line but it is not working.

regex = (/\A(RM|R1)([A-Z])([A-Z])(\d+)(\d\d+)([A-Z])([A-Z])([A-Z]+)-?(\d+)([A-Z])(\d)#?([A-Z])([A-Z])(\d)\z/)

in single line

regex = %r{
        ([A-Z])
        ([A-Z])
        ([A-Z])
        (\d+)
        (\d\d+)
        ([A-Z])
        ([A-Z])
        ([A-Z]+)
        -?
        (\d+)
        ([A-Z])
        (\d)
        #?
        ([A-Z])
        ([A-Z])
        (\d)
     }x

in multiple lines (one group in each line)

What is going wrong with my approach?

Upvotes: 1

Views: 301

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

Here is your regular expression defined in free-spacing mode, which is what I think you are looking for.

regex = /
        \A        # beginning of string
        (RM|R1)   # match 'RM' or 'R1'           CG  1 
        ([A-Z])   # match 1 uppercase letter     CG  2
        ([A-Z])   # match 1 uppercase letter     CG  3
        (\d+)     # match > 0 digits             CG  4
        (\d{2,})  # match > 0 digits             CG  5
        ([A-Z])   # match 1 uppercase letter     CG  6
        ([A-Z])   # match 1 uppercase letter     CG  7
        ([A-Z]+)  # match > 0 uppercase letters  CG  8
        -?        # optionally match '-'
        (\d+)     # match > 0 digits             CG  9
        ([A-Z])   # match 1 uppercase letter     CG 10
        (\d)      # match > 0 digits             CG 11
        \#?       # optionally match '#'
        ([A-Z])   # match 1 uppercase letter     CG 12
        ([A-Z])   # match 1 uppercase letter     CG 13
        (\d)      # match > 0 digits             CG 14
        \z        # end of string
        /x        # free-spacing regex definition mode

"CG" is for "capture group". One of the main uses of free-spacing mode is to document the regex, as I've done here.

I've made two changes to your regex. Firstly, I've replaced (\d\d+) with (\d{2,}), which has the same effect but arguably reads better. Secondly, the character "#" begins a comment in free-spacing mode, so it must be escaped (\#) if it is to be matched.

As an example of the use of this regex,

test_str = "RMAB12345CDEF-6G7#HI8"
m = test_str.match regex
  #=> #<MatchData "RMAB12345CDEF-6G7#HI8" 1:"RM" 2:"A" 3:"B" 4:"123" 5:"45"
  #     6:"C" 7:"D" 8:"EF" 9:"6" 10:"G" 11:"7" 12:"H" 13:"I" 14:"8"> 
m.captures
  #=> ["RM", "A", "B", "123", "45", "C", "D", "EF", "6", "G", "7", "H", "I", "8"] 

Notice that it's not clear how the 5 digits are to be divided between capture groups 4 and 5.

There is one thing you must be careful about when using free-spacing mode. All spaces are removed before the expression is parsed, including any spaces you want matched. For example,

"ab c".match? /ab c/   #=> true
"ab c".match? /ab c/x  #=> false
"abc".match?  /ab c/x  #=> true

Here are some ways to protect the space character (all return true):

"ab c".match? /ab\ c/x           # escape a space character
"ab c".match? /ab[ ]c/x          # put in a character class
"ab c".match? /ab[[:space:]]c/x  # Unicode bracket expression 
"ab c".match? /ab\p{Space}c/x    # Unicode \p{} construct
"ab c".match? /ab\sc/x           # match a whitespace character

Note that \s matches tabs, newlines and two other characters as well as spaces, which may or may not be desired.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You should escape the # symbol as in the free-spacing mode, it denotes a comment start:

Literal white space inside the pattern is ignored, and the octothorpe (#) character introduces a comment until the end of the line. This allows the components of the pattern to be organized in a potentially more readable fashion.

So, replace #? with \#?.

Upvotes: 2

Related Questions