Reputation: 945
I have bunch of html and I need to get all the anchors and the anchor value using Regular Expression.
This is sample html that I need to process:
<P align=center><SPAN style="FONT-FAMILY: Arial; FONT-SIZE: 10px"><SPAN style="COLOR: #666666">View the </SPAN><A href="http://www.google.com"><SPAN style="COLOR: #666666">online version</SPAN></A><SPAN style="COLOR: #666666"> if you are having trouble <A name=hi>displaying </A>this <a name="msg">message</A></SPAN></SPAN></P>
So, I need to be able to all <A name="blah">
.
Any help is greatly appreciated.
Upvotes: 1
Views: 2460
Reputation: 10293
As hundreds of other answers on stackoverflow suggest - its a bad idea to use regex for processing html. use some html parser.
But for example, if still you need a regex to find the href urls, below is an regex you can use to match hrefs and extract its value:
\b(?<=(href="))[^"]*?(?=")
If you want to get contents inside <A>
and </A>
, then using regex is really a bad approach as lookahead/behind in the regex do not support regex producing variable length matches.
Upvotes: 2
Reputation: 100248
Don't forget to add a reference to Microsoft.mshtml.dll
using System;
using System.IO;
using System.Linq;
using System.Windows.Forms;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
string html = "<P align=center><SPAN style=\"FONT-FAMILY: Arial; FONT-SIZE: 10px\"><SPAN style=\"COLOR: #666666\">View the </SPAN><A href=\"http://www.google.com\"><SPAN style=\"COLOR: #666666\">online version</SPAN></A><SPAN style=\"COLOR: #666666\"> if you are having trouble <A name=hi>displaying </A>this <a name=\"msg\">message</A></SPAN></SPAN></P>";
string fileName = Path.Combine(Path.GetTempPath(), Path.GetTempFileName());
System.IO.File.WriteAllText(fileName, html);
var browser = new WebBrowser();
browser.Navigated += (sender, e) => browser_Navigated(sender, e);
browser.Navigate(new Uri(fileName));
}
private void browser_Navigated(object sender, WebBrowserNavigatedEventArgs e)
{
var browser = (WebBrowser)sender;
var links = browser
.Document
.Links
.OfType<HtmlElement>()
.Select(l => ((mshtml.HTMLAnchorElement)l.DomElement).href);
//result: { "http://www.google.com", .. }
}
}
}
Upvotes: -1
Reputation: 17931
the pattern is <a.*?(?<attribute>href|name)="(?<value>.*?)".*?>
so your c# code will be
Regex expression = new Regex("<a.*?(?<attribute>href|name)=\"(?<value>.*?)\".*?>", RegexOptions.IgnoreCase);
Upvotes: 0