Reputation: 3
I need to use regex to search through an html file and replace href="pagename"
with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.
Upvotes: 0
Views: 93
Reputation: 7026
There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]
: any of the items in square-brackets, followed by\s*
: any number of whitespaces (maybe zero), =
\s*
any more whitespaces, ['"]
either quote type, \w+
: a word (without any slashes or dots - if you want to include .html
then use [.\w]+
instead ), ['"]
: another quote of any kind.replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/
, then the stuff in the second and third sets of brackets.
You will need to put the first string in @" quotes, and also escape the double-quotes as ""
.
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.
) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html
Upvotes: 0
Reputation: 100381
I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, @"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P
Upvotes: 0
Reputation: 78920
HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.
Upvotes: 3