Mostafa Talebi
Mostafa Talebi

Reputation: 9183

regular expression capture groups

I'm learning regular expression (currently on Javascript).

My question is that:

I have a straight string of some length.

In this string there are at least (obligatory) three patterns.

And as a result I want to rule.exec() string and get a three-elements array. Each pattern into a separate element.

How should I approach this? Currently I have reached it, but with a lot of up and downs and don't know what should EXACTLY be done to group a capture? Is it parenthesis () that separate each group of Regular Expression.

My Regular Expression Rule example:

var rule = /([a-zA-Z0-9].*\s?(@classs?)+\s+[a-zA-Z0-9][^><]*)/g;
var str = "<Home @class www.tarjom.ir><string2 stringValue2>";
var res;
var keys = [];
var values = [];
while((res = rule.exec(str)) != null)
{
    values.push(res[0]);
}
 console.log(values);

// begin to slice them
var sliced = [];
for(item in values)
{
    sliced.push(values[item].split(" "));// converting each item into an array and the assign them to a super array
}



/// Last Updated on 7th of Esfand
console.log(sliced);

And the return result (with firefox 27 - firebug console.log)

 [["Home", "@class", "www.tarjom.ir"]]

I have got what I needed, I just need a clarification on the return pattern.

Upvotes: 0

Views: 881

Answers (1)

SQB
SQB

Reputation: 4078

Yes, parentheses capture everything between them. Captured groups are numbered by their opening parenthesis. So if /(foo)((bar)baz)/ matches, your first captured group will contain foo, your second barbaz, and your third bar. In some dialects, only the first 9 capturing groups are numbered.

Captured groups can be used for backreferencing. If you want to match "foobarfoo", /(foo)bar\1/ will do that, where \1 means "the first group I captured".

There are ways to avoid capturing, if you just need the parenthesis for grouping. For instance, if you want to match either "foo" or "foobar", /(foo(bar)?)/ will do so, but may have captured "bar" in its second group. If you want to avoid this, use /(foo(?:bar)?)/ to only have one capture, either "foo" or "foobar".


The reason your code shows three values, is because of something else. First, you do a match. Then, you take your first capture and split that on a space. That is what you put in your array of results. Note that you push the entire array in there at once, so you end up with an array of arrays. Hence the double brackets.

Your regex matches (pretending we're in Perl's eXtended legibility mode):

/                   # matching starts
  (                 # open 1st capturing group
    [a-zA-Z0-9]     # match 1 character that's in a-z, A-Z, or 0-9
    .*              # match as much of any character possible
    \s?             # optionally match a white space (this will generally never happen, since the .* before it will have gobbled it up)
    (               # open 2nd capturing group
      @classs?      # match '@class' or '@classs'
    )+              # close 2n group, matching it once or more
    \s+             # match one or more white space characters
    [a-zA-Z0-9]     # match 1 character that's in a-z, A-Z, or 0-9
    [^><]*          # match any number of characters that's not an angle bracket
  )                 # close 1st capturing group
/g                  # modifiers - g: match globally (repeatedly match throughout entire input)

Upvotes: 3

Related Questions