David Mårtensson
David Mårtensson

Reputation: 7620

Fetch group name with linq from regex matches

I am trying to build a very simplified lexer using regex and named groups in c#.

I can get all the matched tokens along with position just fine. But I cannot find a way to get the matched group name also.

I was planning to use that as the token type.

Here is a small example designed to lex simple sql.

var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany (m => m
    .Groups
    .Cast<Group>()
    .SelectMany (g => g
        .Captures
        .Cast<Capture>()
        .Select (c => new {c.Index, c.Length, c.Value})))
.Skip(1)
.Where (m => m.Length > 0)
.OrderBy (m => m.Index);

This returns a small result like this:

0 6 Select 
7 1 * 
9 4 from 
14 9 items  
24 5 where  
30 2 id 
33 1 >  
35 4 '10' 

But how can I get the capture group names into the table, is it possible?

This is not a home work exercise or any type of school work, its an experiment I am doing for a simple automation api for one of our products.

I can probably rewrite it using a more verbose solution but I kind of like the "on liner approach" of this one ;)

And if all else fails I already have a full lexer using real classes and much more advanced pattern matching, but that is not really required for this :D

UPDATE! I know what groups are available, what I like to get is, for each capture in the result, which group was it that caught it.

As the first comment refers to, there is a method to get all groups from a regex, but then you have to fetch the results by the group, there does not seem to be a way to get the group from the capture.

Upvotes: 3

Views: 4378

Answers (1)

David M&#229;rtensson
David M&#229;rtensson

Reputation: 7620

[Appended a new solution I found following the link to the possible duplicate]

The answer to my question seems to be that it is not possible to get group names in any way except from the regex object.

I used part of the solution from the first comment reference to work around this but I would have liked to be able to go the more direct route.

Here is the solution I ended up with. (uses Linqpad dump)

var source = "select * from people where id > 10";

var re = new Regex(@"
    (?:
    (?<reserved>select|from|where|and|or|null|is|not)|
    (?<string>'[^']*')|
    (?<number>\d+)|
    (?<identifier>[a-z][a-z_0-9]+|\[[^\]]+\])|
    (?:\s+)|
    (?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*|,|.)|
    (?<other>.*)
    )+
    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Compiled);
    
(
    from name 
    in re.GetGroupNames() 
    select new {name = name, captures = re.Match(source).Groups[name].Captures}
)
.Where (r => r.name != "0")
.SelectMany (r => (
    from Capture c 
    in r.captures 
    where c.Length > 0
    select new {Type = r.name, Index = c.Index, Length = c.Length, Value = c.Value}
    )
).OrderBy (r => r.Index).ToList().Dump();

Based on a comment on the possible duplicate, fro NET 4.7 Group now have a Name property which was not present when I made this test so in case anyone stumbles upon this and is not discouraged enough here is a version that does what I originally tried but no longer need for anything :)

var matches = Regex.Matches("Select * from items where id > '10'", @"
(?:
(?<string>'[^']*')|
(?<number>\d+)|
(?<identifier>[a-zA-Z][a-zA-Z_0-9]+)|
(?:\s+)|
(?<operator><=|>=|<>|!=|\+|=|\(|\)|<|>|\*)|
(?<other>.*)
)+
", RegexOptions.IgnorePatternWhitespace)
.Cast<Match>()
.SelectMany(m => m
   .Groups
   .Cast<Group>()
   .SelectMany(g => g
      .Captures
      .Cast<Capture>()
      .Select(c => new { c.Index, c.Length, c.Value, g.Name })))
.Skip(1)
.Where(m => m.Length > 0)
.OrderBy(m => m.Index).Dump();

Upvotes: 2

Related Questions