Reputation: 1491

Regex - number of characters for sequence

I have the following pattern:

<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
//etc

There is always one letter and digits. I need to create a regex which uses number from the tag and applies it to the sequence between tags. I know that I can use a backreference but I don't know how to construct a regex. Here is incomplete regex:

"^<tag-([2-9])>[A-Z][0-9]/*how to apply here number from the tag ?*/</tag-\\1>$"

Edit

The following strings are not matched:

<tag-2>11</tag-2> //missing letter
<tag-2>BB</tag-2> // missing digit
<tag-3>B123</tag-3> //too many digits
<tag-3>AA1</tag-3> //should be only one letter and two digits
<tag-4>N12</tag-4> //too few digits

Upvotes: 0

Answers (2)

Cary Swoveland

Reputation: 110685

Regular expressions cannot contain elements that are functions of the values of back-references (other than the back-references themselves). That's because regular expressions are static from the time they are constructed.

One could, however, extract the desired string, or conclude that the sting contains no valid substring, in two steps. First attempt to match the string against /<tag-(\d+)>, where the contents of the capture group, after being converted to an integer, equals the length of the string that begins with a capital letter and is followed by digits. That information can then be used to construct a second regular expression that is used to verify the remainder of the match and extract the desired string.

I will use Ruby to illustrate how that might be done here. The operations--and certainly the two regular expressions--should be clear even to readers who are not familiar with Ruby.

Code

R = /<tag-(\d+)>/           # a constant

def doit(str)
  m = str.match(R)          # obtain a MatchData object; else nil
  return nil if m.nil?      # finished if no match
  n = m[1].to_i-1           # required number of digits
  r = /\A\p{Lu}\d{#{n}}(?=<\/tag-#{m[1]}>)/
                            # regular expression for second match
  str[m.end(0).to_i..-1][r] # extract the desired string; else nil
end

Examples

arr = <<_.each_line.map(&:chomp)
<tag-2>B1</tag-2>
<tag-3>A12</tag-3>
<tag-4>M123</tag-4>
<tag-2>11</tag-2>
<tag-2>BB</tag-2>
<tag-3>B123</tag-3>
<tag-3>AA1</tag-3>
<tag-4>N12</tag-4>
_
  #=> ["<tag-2>B1</tag-2>",   "<tag-3>A12</tag-3>",
  #    "<tag-4>M123</tag-4>", "<tag-2>11</tag-2>",
  #    "<tag-2>BB</tag-2>",   "<tag-3>B123</tag-3>",
  #    "<tag-3>AA1</tag-3>",  "<tag-4>N12</tag-4>"]

arr.map do |line|
  s = doit(line)
  s = 'nil' if s.nil?
  puts "#{line.ljust(22)}: #{s}"
end
<tag-2>B1</tag-2>     : B1
<tag-3>A12</tag-3>    : A12
<tag-4>M123</tag-4>   : M123
<tag-2>11</tag-2>     : nil
<tag-2>BB</tag-2>     : nil
<tag-3>B123</tag-3>   : nil
<tag-3>AA1</tag-3>    : nil
<tag-4>N12</tag-4>    : nil

Explanation

Note that (?=<\/tag-#{m[1]}>) (part of r in the body of the method) is a positive lookahead, meaning that "<\/tag-#{m[1]}>" (with #{m[1]} substituted out) must be matched, but is not part of the match that is returned.

The step-by-step calculations are as follows.

str = "<tag-2>B1</tag-2>"

m = str.match(R)
  #=> #<MatchData "<tag-2>" 1:"2"> 
m[0]
  #=> "<tag-2>"  (match)
m[1]
  #=> "2"  (contents of capture group 1)
m.end(0)
  #=> 7  (index of str where the match ends, plus 1) 
m.nil?
  #=> false  (do not return)
n = m[1].to_i-1
  #=> 1  (number of digits required)
r = /\A\p{Lu}\d{#{n}}(?=\<\/tag\-#{m[1]}\>)/
  #=> /\A\p{Lu}\d{1}(?=\<\/tag\-2\>)/
s = str[m.end(0).to_i..-1]
  #=> str[7..-1]
  #=> "B1</tag-2>" 
s[r]
  #=> "B1"

Upvotes: 1

Mike Hill

Reputation: 3772

It looks like you're trying to create a pattern that will interpret a number in order to determine how long a string should be. I don't know of any feature to automate this process in any regular expression engine, but it can be done in a more manual fashion by enumerating all cases which you wish to handle.

For example, tags 2 through 9 can be handled as such:

<tag-2>: ^<tag-2>[A-Z][0-9]</tag-2>$
<tag-3>: ^<tag-3>[A-Z][0-9]{2}</tag-3>$
<tag-4>: ^<tag-4>[A-Z][0-9]{3}</tag-4>$
<tag-5>: ^<tag-5>[A-Z][0-9]{4}</tag-5>$
<tag-6>: ^<tag-6>[A-Z][0-9]{5}</tag-6>$
<tag-7>: ^<tag-7>[A-Z][0-9]{6}</tag-7>$
<tag-8>: ^<tag-8>[A-Z][0-9]{7}</tag-8>$
<tag-9>: ^<tag-9>[A-Z][0-9]{8}</tag-9>$

By removing the grouping and back-references you eliminate some complications that can occur when trying to combine regular expression patterns and can produce the following:

^(<tag-2>[A-Z][0-9]</tag-2>|<tag-3>[A-Z][0-9]{2}</tag-3>|<tag-4>[A-Z][0-9]{3}</tag-4>|<tag-5>[A-Z][0-9]{4}</tag-5>|<tag-6>[A-Z][0-9]{5}</tag-6>|<tag-7>[A-Z][0-9]{6}</tag-7>|<tag-8>[A-Z][0-9]{7}</tag-8>|<tag-9>[A-Z][0-9]{8}</tag-9>)$

Upvotes: 0

Regex - number of characters for sequence

Answers (2)

Related Questions