Seb Nilsson
Seb Nilsson

Reputation: 26408

C# Regex: Getting URL and text from multiple "a href"-tags

I want to be able to scrape a webpage containing multiple "<a href"-tags and return a structured collection of them.

<div>
    <p>Lorem ipsum... <a href="https://stackoverflow">Classic link</a>
        <a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a>
    </p>
    <a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a>
    <iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU"
        frameborder="0" allowfullscreen></iframe><!-- Don't catch me! -->
</div>

So I want these values:

As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.

So I basically want a regex to fill in the following code:

public IEnumerable<Tuple<string, string>> GetLinks(string html) {
     string pattern = string.Empty; // TODO: Get solution from Stackoverflow
     var matches = Regex.Matches(html, pattern);

     foreach(Match match in matches) {
         yield return new Tuple<string, string>(
             match.Groups[0].Value, match.Groups[1].Value);
     }
}

Upvotes: 3

Views: 9282

Answers (2)

pierroz
pierroz

Reputation: 7870

I've always read that parsing Html with Regular Expression is the Evil. Ok... it's surely true...
But like the Evil, Regex are so fun :)
So I'd give a try to this one:

Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>");

foreach (Match match in r.Matches(html))
    yield return new Tuple<string, string>(
        match.Groups["href"].Value, match.Groups["value"].Value);

Upvotes: 4

WKordos
WKordos

Reputation: 2245

isnt it easier to use html agility pack and xpath ? than regex

it would be like

var webGet = new HtmlWeb();
var document = webGet.Load(url); 
var aNodeCollection = document.DocumentNode.Descendants("//a[@href]")

foreach (HtmlNode node id aNodeCollection)
{
node.Attributes["href"].value
node.htmltext
}

its pseudo code

Upvotes: 3

Related Questions