Reputation: 30328
I've always been interested in writing web software like forums or blogs, things which take a limited markup to rewrite into HTML. But lately, I've noticed more and more that for PHP, try googling "PHP BBCode parser -PEAR" and test a few out, you either get an inefficient mess, or you get poor code with XSS holes here and there.
Taking my previously mentioned example, of the poor BBCode parsers out there, how would you avoid XSS? I'll now take your typical regular expression for handling a link, and you can mention how vulnerable it is and how to avoid it.
// Assume input has already been encoded by htmlspecialchars with ENT_QUOTES
$text = preg_replace('#\[url\](.*?)\[/url\]#i','<a href="\1">\1</a>', $text);
$text = preg_replace('#\[url=(.*?)\](.*?)\[/url\]#i','<a href="\1">\2</a>', $text);
Handling image tags are hardly more secure than this.
So I have several specific questions, mostly specific to PHP implementations.
(.*?)
and a callback, then ascertain whether or not the input is a valid link? As would be obvious above, the javascript:alert('XSS!')
would work in the above URL tags, but would fail if the uri-matching was done.urlencode()
within a callback, would they be any deterrence or problem (as far as URI standards go)?I know my example is one of many, and is more specific than some. However, don't shirk from providing your own. So, I'm looking for principles and best practices, and general recommendations for XSS-protection in a text-parsing situation.
Upvotes: 1
Views: 1231
Reputation: 536539
test a few out, you either get an inefficient mess, or you get poor code with XSS holes
Hell yeah. I've not met a bbcode implementation yet that wasn't XSS-vulnerable.
'<a href="\1">\1</a>'
No good: fails to HTML-escape ‘<’, ‘&’ and ‘"’ characters.
Is it better practice, in this example, to only match using a uri/url validation expression? Or, is it better to use (.*?) and a callback, then ascertain whether or not the input is a valid link?
I would take the callback. You need the callback anyway to do the HTML-escaping; it's not possible to be secure with only simple string replacement. Drop the sanitisation in whilst you're doing it.
What about functions like urlencode() within a callback
Nearly; actually you need htmlspecialchars(). urlencode() is about encoding query parameters, which isn't what you need here.
Would it be safer to write a full-stack parser?
Yes.
bbcode is not really amenable to regex parsing, because it's a recursive tag-based language (like XML, which regex also cannot parse). Many bbcode holes are caused by nesting and misnesting problems. For example:
[url]http://www.example.com/[i][/url]foo[/i]
Could come out as something like
<a href="http://www.example.com/<i>">foo</i>
there are many other traps that generate broken code (up to an including XSS holes) on various bbcode implementations.
I'm looking for principles and best practices
If you need a bbcode-like language that you can regex, you need to:
It's still damned hard to get right. A proper parser is much more likely to be watertight.
Upvotes: 4