Jimmy
Jimmy

Reputation: 5241

Regex: captures, groups, confusion

I can't seem to figure out captures + groups in Regex (.net).

Let's say I have the following input string, where each letter is actually a placeholder for more complex regex expression (so simple character exclusion won't work):

CBDAEDBCEFBCD

Or, more generically, here is a string pattern written in 'regex':

(C|B|D)*A(E*)(D|B|C)*(E*)F(B|C|D)*

There will only be one A and one F. I need to capture as individual 'captures' (or matches or groups) all instances of B, C, D (which in my app are more complex groups) that occur after A and before F. I also need A and F. I don't need E. And I don't need the C,B,D before the A or the B,C,D after the F.

I would expect the correct result to be:

Groups["start"] (1 capture) = A
Groups["content"] (3 captures)  
  Captures[0] = D  
  Captures[1] = B
  Captures[2] = C
Groups["end"] (1 capture) = F

I tried a few feeble attempts but none of them worked.

Only "incorrectly" captures the last C before EF in the sample string above (as well as correctly start = A, end = F)

(?<=(?<start>A)).+(?<content>B|C|D).+(?=(?<end>F))

Same results as above (just added a + after (?B|C|D) )

(?<=(?<start>A)).+(?<content>B|C|D)+.+(?=(?<end>F))

Got rid of look-around stuff... same result as above

(?<start>A).+(?<content>B|C|D)+.+(?<end>F)

And then my good-for-nothing brain went on strike.

So, what's the right way to approach this? Are look-arounds really needed for this or not?

Thanks!

Upvotes: 0

Views: 218

Answers (2)

Alan Moore
Alan Moore

Reputation: 75232

Yeah, forget the lookarounds, they just complicate things needlessly. But I suspect your final regex will work if you make that first .+ reluctant:

(?<start>A).+?(?<content>B|C|D)+.+(?<end>F)

EDIT: yep:

string s = "CBDAEDBCEFBCD";
Regex r = new Regex(@"(?<start>A).+?(?<content>B|C|D)+.+(?<end>F)");

foreach (Match m in r.Matches(s))
{
  Console.WriteLine(@"Groups[""start""] = {0}", m.Groups["start"]);
  foreach (Capture c in m.Groups["content"].Captures)
  {
    Console.WriteLine(@"Capture[""content""] = {0}", c.Value);
  }
  Console.WriteLine(@"Groups[""end""] = {0}", m.Groups["end"]);
}

output:

Groups["start"] = A
Capture["content"] = D
Capture["content"] = B
Capture["content"] = C
Groups["end"] = F

Upvotes: 2

Snekse
Snekse

Reputation: 15799

Since you said all instance of C,B,D, I would think you'd want to use a grouping for that [CBD]* Also, if you're just looking for something to be after the letter A but before F, then you should be able to use those literals along with some exclusions.

Here's a pattern I came up with. Group $4 should contain the letter DBC

([^A]*)(A)([^CBDF]*)([CBD]*)([^F]*)(F)(.*)

Here's an example of this pattern in action.

The question is, what do you want if the original string is CBDAEDEBECEFBCD?

Upvotes: 0

Related Questions