Rajdip Patel
Rajdip Patel

Reputation: 551

Regular Expression Capturing Group Issue

I want to parse all the link tags from html file. So for that I have written following regular expression as below.

var pattern = @"<(LINK).*?HREF=(""|')?(?<URL>.*?)(""|')?.*?>";
var regExOptions = RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline;

var linkRegEx = new Regex(pattern , regExOptions );

foreach (Match match in linkRegEx.Matches(htmlFile))
{
    var group = match.Groups["URL"];
    var url = group.Value;
} 

But what happens is that I found matches from html file but I am getting blank capturing group.

Upvotes: 0

Views: 93

Answers (1)

p.s.w.g
p.s.w.g

Reputation: 149020

You could try a pattern like this:

var pattern = @"<(LINK).*?HREF=(?:([""'])(?<URL>.*?)\2|(?<URL>[^\s>]*)).*?>";

This will match:

  • a literal <
  • a literal LINK, captured in group 1
  • zero or more of any character, non-greedily
  • either of the following
    • a single " or ', captured in group 2
    • zero or more of any character, non-greedily, captured in group URL.
    • whatever was matched in group 2 (the \2 is a back-reference)
      or
    • zero or more of any character except whitespace or >, greedily, captured in group URL.
  • zero or more of any character, non-greedily
  • a literal >

This will correctly handle inputs like:

  • <LINK HREF="Foo"> produces url = "Foo"
  • <LINK HREF='Bar'> produces url = "Bar"
  • <LINK HREF=Baz> produces url = "Baz"

Upvotes: 1

Related Questions