Reputation: 2347
I have written some code to match and parse a Markdown link of this style:
[click to view a flower](http://www.yahoo.com/flower.html)
I have this code that is meant to extract the link text, then the url itself, then stick them in an A HREF link. I am worried though that maybe I am missing a way for someone to inject XSS, because I am leaving in a decent amount of characters. is this safe?
$pattern_square = '\[(.*?)\]';
$pattern_round = "\((.*?)\)";
$pattern = "/".$pattern_square.$pattern_round."/";
preg_match($pattern, $input, $matches);
$words = $matches[1];
$url = $matches[2];
$words = ereg_replace("[^-_@0-9a-zA-Z\.]", "", $words);
$url = ereg_replace("[^-A-Za-z0-9+&@#/%?=~_|!:.]","",$url);
$final = "<a href='$url'>$words</a>";
It seems to work okay, and it does exclude some stupid URLs that include semicolons and backslashes, but I don't care about those URLs.
Upvotes: 5
Views: 2298
Reputation: 50021
If you have already passed the input through htmlspecialchars
(which you are doing, right?) then it is already impossible for the links to contain any characters that could cause XSS.
If you have not already passed the input through htmlspecialchars
, then it doesn't matter what filtering you do when parsing the links, because you're already screwed, because one can trivially include arbitrary HTML or XSS outside the links.
This function will safely parse Markdown links in text while applying htmlspecialchars
on it:
function doMarkdownLinks($s) {
return preg_replace_callback('/\[(.*?)\]\((.*?)\)/', function ($matches) {
return '<a href="' . $matches[2] . '">' . $matches[1] . '</a>';
}, htmlspecialchars($s));
}
If you need to do anything more complicated than that, I advise you to use an existing parser, because it is too easy to make a mistake with this sort of thing.
Upvotes: 4