Reputation: 26408
I want to be able to scrape a webpage containing multiple "<a href
"-tags and return a structured collection of them.
<div>
<p>Lorem ipsum... <a href="https://stackoverflow">Classic link</a>
<a title="test" href=http://sloppy-html-5-href.com>I lovez HTML 5</a>
</p>
<a class="abc" href='/my-tribute-to-javascript.html'>I also love JS</a>
<iframe width="420" height="315" src="http://www.youtube.com/embed/JVPT4h_ilOU"
frameborder="0" allowfullscreen></iframe><!-- Don't catch me! -->
</div>
So I want these values:
As you can see, only values in an "a href" should be caught, with both link and content within the tags. It should support all HTML 5-valid href. The href-attributes can be surrounded with any other attributes.
So I basically want a regex to fill in the following code:
public IEnumerable<Tuple<string, string>> GetLinks(string html) {
string pattern = string.Empty; // TODO: Get solution from Stackoverflow
var matches = Regex.Matches(html, pattern);
foreach(Match match in matches) {
yield return new Tuple<string, string>(
match.Groups[0].Value, match.Groups[1].Value);
}
}
Upvotes: 3
Views: 9282
Reputation: 7870
I've always read that parsing Html with Regular Expression is the Evil. Ok... it's surely true...
But like the Evil, Regex are so fun :)
So I'd give a try to this one:
Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>");
foreach (Match match in r.Matches(html))
yield return new Tuple<string, string>(
match.Groups["href"].Value, match.Groups["value"].Value);
Upvotes: 4
Reputation: 2245
isnt it easier to use html agility pack and xpath ? than regex
it would be like
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var aNodeCollection = document.DocumentNode.Descendants("//a[@href]")
foreach (HtmlNode node id aNodeCollection)
{
node.Attributes["href"].value
node.htmltext
}
its pseudo code
Upvotes: 3