Reputation: 188
I have a block of html that looks something like this;
<p><a href="docs/123.pdf">33</a></p>
There are basically hundreds of anchor links which I need to replace the href based on the anchor text. For example, I need to replace the link above with something like;
<a href="33.html">33</a>.
I will need to take the value 33 and do a lookup on my database to find the new link to replace the href with.
I need to keep it all in the original html as above!
How can I do this? Help!
Upvotes: 2
Views: 3392
Reputation: 18664
Although this doesn't answer your question, the HTML Agility Pack is a great tool for manipulating and working with HTML: http://html-agility-pack.net
It could at least make grabbing the values you need and doing the replaces a little easier.
Contains links to using the HTML Agility Pack: How to use HTML Agility pack
Upvotes: 5
Reputation: 75272
So, what you want to do is generate the replacement string based on the contents of the match. Consider using one of the Regex.Replace
overloads that take a MatchEvaluator. Example:
static void Main()
{
Regex r = new Regex(@"<a href=""[^""]+"">([^<]+)");
string s0 = @"<p><a href=""docs/123.pdf"">33</a></p>";
string s1 = r.Replace(s0, m => GetNewLink(m));
Console.WriteLine(s1);
}
static string GetNewLink(Match m)
{
return string.Format(@"(<a href=""{0}.html"">{0}", m.Groups[1]);
}
I've actually taken it a step further and used a lambda expression instead of explicitly creating a delegate method.
Upvotes: 0
Reputation: 74385
Slurp your HTML into an XmlDocument (your markup is valid, isn't it?) Then use XPath to find all the <a>
tags with an href
attribute. Apply the transform and assign the new value to the href
attribute. Then write the XmlDocument out.
Easy!
Upvotes: 1
Reputation: 27953
Consider using the the following rough algorithm.
using System;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
static class Program
{
static void Main ()
{
string html = "<p><a href=\"docs/123.pdf\">33</a></p>"; // read the whole html file into this string.
StringBuilder newHtml = new StringBuilder (html);
Regex r = new Regex (@"\<a href=\""([^\""]+)\"">([^<]+)"); // 1st capture for the replacement and 2nd for the find
foreach (var match in r.Matches(html).Cast<Match>().OrderByDescending(m => m.Index))
{
string text = match.Groups[2].Value;
string newHref = DBTranslate (text);
newHtml.Remove (match.Groups[1].Index, match.Groups[1].Length);
newHtml.Insert (match.Groups[1].Index, newHref);
}
Console.WriteLine (newHtml);
}
static string DBTranslate(string s)
{
return "junk_" + s;
}
}
(The OrderByDescending makes sure the indexes don't change as you modify the StringBuilder.)
Upvotes: 1
Reputation: 31609
Use a regexp to find the values and replace
A regexp like "/<p><a herf=\"[^\"]+\">([^<]+)<\\/a><\\/p>
to match and capture the ancor text
Upvotes: 0