C# Regex: Getting URL and text from multiple "a href"-tags

Question

I want to be able to scrape a webpage containing multiple ""-tags and return a structured collection of them.




    Lorem ipsum... Classic link
        I lovez HTML 5
    
    I also love JS
    



So I want these values:


https://stackoverflow | Classic link
http://sloppy-html-5-href.com | I lovez HTML 5
/my-tribute-to-javascript.html | I also love JS


As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.

So I basically want a regex to fill in the following code:

public IEnumerable> GetLinks(string html) {
     string pattern = string.Empty; // TODO: Get solution from Stackoverflow
     var matches = Regex.Matches(html, pattern);

     foreach(Match match in matches) {
         yield return new Tuple(
             match.Groups[0].Value, match.Groups[1].Value);
     }
}

pierroz · Accepted Answer

I've always read that parsing Html with Regular Expression is the Evil. Ok... it's surely true...
But like the Evil, Regex are so fun :)
So I'd give a try to this one:

Regex r = new Regex(@".*?)(""|').*?>(?.*?)");

foreach (Match match in r.Matches(html))
    yield return new Tuple(
        match.Groups["href"].Value, match.Groups["value"].Value);

C# Regex: Getting URL and text from multiple "a href"-tags

Answers (2)

Related Questions

C# Regex: Getting URL and text from multiple &quot;a href&quot;-tags

Answers (2)

Related Questions

C# Regex: Getting URL and text from multiple "a href"-tags