Reputation: 19158
hi i got problems to get my regex to work. im working with C# asp.net i will post the code i use now and what i cant get to work is the second regex to get whatever is in the href="LINK"
thx in advance
var textBody = "lorem ipsum... <a href='http://www.link.com'>link</a>";
var urlTagPattern = new Regex(@"<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>", RegexOptions.IgnoreCase);
//THIS IS THE REGEX
var hrefPattern = new Regex(@"HREF={:q}\>", RegexOptions.IgnoreCase);
var urls = urlTagPattern.Matches(textBody);
foreach (Match url in urls)
{
var hrefs = hrefPattern.Match(url.ToString());
litStatus.Text = hrefs.ToString();
}
Upvotes: 0
Views: 14160
Reputation: 536715
Welcome to your daily installment of Don't Use Regex To Parse HTML. In this edition of Don't Use Regex To Parse HTML, we'll be reminding you not to use regex to parse HTML because HTML cannot reliably be parsed by a regex and dozens of valid HTML constructs will break the naïve regex proposed. We won't be mentioning all the additional invalid ones in common use on the web in Don't Use Regex To Parse HTML today.
Also in Don't Use Regex To Parse HTML, we'll be linking to the Html Agility Pack, a .NET library you can use to parse HTML properly and subsequently extract link URLs reliably in just a couple of lines of code (a very similar example being present on that page).
We hope you have enjoyed today's Don't Use Regex To Parse HTML, and look forward to seeing you again tomorrow for another exciting edition of Don't Use Regex To Parse HTML, when someone posts another question about using regex to parse HTML. But that's all from Don't Use Regex To Parse HTML for now. Bye!
Upvotes: 14
Reputation: 26
The following example searches an input string and prints out all the href="…" values and their locations in the string. It does this by constructing a compiled Regex object and then using a Match object to iterate through all the matches in the string. In this example, the metacharacter \s matches any space character, and \S matches any nonspace character.
' VB
Sub DumpHrefs(inputString As String)
Dim r As Regex
Dim m As Match
r = New Regex("href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _
RegexOptions.IgnoreCase Or RegexOptions.Compiled)
m = r.Match(inputString)
While m.Success
Console.WriteLine("Found href " & m.Groups(1).Value _
& " at " & m.Groups(1).Index.ToString())
m = m.NextMatch()
End While
End Sub
// C#
void DumpHrefs(String inputString) {
Regex r;
Match m;
r = new Regex("href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))",
RegexOptions.IgnoreCase|RegexOptions.Compiled);
for (m = r.Match(inputString); m.Success; m = m.NextMatch())
{
Console.WriteLine("Found href " + m.Groups[1] + " at "
+ m.Groups[1].Index);
}
}
Upvotes: 1