Reputation: 666
I have a comments form that is currently plaintext. This is nice and easy because I can just strip anything vaguely resembling HTML out of it using strip_tags and htmlspecialchars the rest.
However, There are plans to turn this into a HTML form with support for several tags <a><b><p><h3>
...etc.
The problem I see is that even strip_tag'ing the input and keeping those tags only would still leave me open to XSS and jscript inside the tags.
I know that HTML Purifier can be used to fix this, but it looks really bulky, slow and I'm finding it hard to believe that there isn't a better way?
I had considered using a BBCode editor instead, but assuming that these insert code into the database as BB, how would I then convert it from BB back to HTML in order to display it?
Upvotes: 0
Views: 805
Reputation:
Please do consider another time if you have the time and effort to write you own validation scheme, as this is pretty complex.
For one, take a look at what happens if you do not use recursion.
Badword : CopyThis
CopyCopyThisThis
Which leaves you with
CopyThis
The bad word...
Upvotes: 1
Reputation: 18861
BBCode is indeed a very easy and useful solution for your problem.
I am using MarkItUp! editor, but you can surely find many others.
how would I then convert it from BB back to HTML in order to display it?
The answer is simple: You can use preg_replace
for it.
I implemented this a while back myself, so I can let you use my parser code. It translates the basic tags, as well as some custom ones like [center]
. It's really easy to add your own tags or replace the current ones.
The script consists of a huge array with regexes and replacements, and one preg_replace
call.
function replaceBBcode($str) {
$replace = array(
// inline text formats
'/\[b\](.*?)\[\/b\]/is' => '<b>$1</b>',
'/\[i\](.*?)\[\/i\]/is' => '<i>$1</i>',
'/\[u\](.*?)\[\/u\]/is' => '<u>$1</u>',
'/\[s\](.*?)\[\/s\]/is' => '<s>$1</s>',
'/\[sup\](.*?)\[\/sup\]/is' => '<sup>$1</sup>',
'/\[sub\](.*?)\[\/sub\]/is' => '<sub>$1</sub>',
// headings
'/\[h1\](.*?)\[\/h1\]/is' => '<h1>$1</h1>',
'/\[h2\](.*?)\[\/h2\]/is' => '<h2>$1</h2>',
'/\[h3\](.*?)\[\/h3\]/is' => '<h3>$1</h3>',
'/\[h4\](.*?)\[\/h4\]/is' => '<h4>$1</h4>',
'/\[h5\](.*?)\[\/h5\]/is' => '<h5>$1</h5>',
// formatting tags
'/\[(?:hr|line)\]/is' => '<hr />',
'/\[br\/?\]/is' => '<br />',
// links
'/\[url=([^\]]+)\](.*?)\[\/url\]/is' => '<a href="$1">$2</a>',
'/\[link=([^\]]+)\](.*?)\[\/link\]/is' => '<a href="$1">$2</a>',
'/\[url\](.*?)\[\/url\]/is' => '<a href="$1" title="$1">$1</a>',
'/\[link\](.*?)\[\/link\]/is' => '<a href="$1" title="$1">$1</a>',
'/\[img=([^\]]+)\]/is' => '<img src="$1" alt="" />',
// text blocks and block formats
'/\[font=([^\]]+)\](.*?)\[\/font\]/is' => '<span style="font-family: $1;">$2</span>',
'/\[size=([0-9]+)\](.*?)\[\/size\]/is' => '<span style="font-size: $1pt;">$2</span>',
'/\[color=([^\]]+)\](.*?)\[\/color\]/is' => '<span style="color: $1;">$2</span>',
'/\[bgcolor=([^\]]+)\](.*?)\[\/bgcolor\]/is' => '<span style="background-color: $1;">$2</span>',
'/\[p\](.*?)\[\/p\]/is' => '<p>$1</p>',
// alignment blocks
'/\[align=(left|center|right|justify)\](.*?)\[\/align\]/is' => '<div style="text-align: $1;">$2</div>',
'/\[center\](.*?)\[\/center\]/is' => '<div style="text-align: center;">$1</div>',
'/\[left\](.*?)\[\/left\]/is' => '<div style="text-align: left;">$1</div>',
'/\[right\](.*?)\[\/right\]/is' => '<div style="text-align: right;">$1</div>',
'/\[justify\](.*?)\[\/justify\]/is' => '<div style="text-align: justify;">$1</div>',
// lists
'/\[list=(disc|circle|square)\](.*?)\[\/list\]/is' => '<ul style="list-style-type:$1;">$2</ul>',
'/\[list\](.*?)\[\/list\]/is' => '<ul>$1</ul>',
'/\[list=a\](.*?)\[\/list\]/s' => '<ol style="list-style-type:lower-alpha;">$1</ol>',
'/\[LIST=a\](.*?)\[\/LIST\]/s' => '<ol style="list-style-type:lower-alpha;">$1</ol>',
'/\[list=A\](.*?)\[\/list\]/s' => '<ol style="list-style-type:upper-alpha;">$1</ol>',
'/\[LIST=A\](.*?)\[\/LIST\]/s' => '<ol style="list-style-type:upper-alpha;">$1</ol>',
'/\[list=1\](.*?)\[\/list\]/is' => '<ol style="list-style-type:decimal;">$1</ol>',
'/\[list=I\](.*?)\[\/list\]/is' => '<ol style="list-style-type:upper-roman;">$1</ol>',
'/\[\*\]/is' => '<li>',
// videos
'/\[(?:youtube|video|media|movie){1}\](?:https?\:\/\/)?(?:www\.)?(?:youtube\.com\/watch\?v=|youtube\.com\/v\/|youtu\.be\/)?([a-z0-9\-\_]+)\[\/(?:youtube|video|media|movie){1}\]/is'
=> '<iframe width="560" height="315" src="http://www.youtube.com/embed/$1?wmode=opaque" frameborder="0" allowfullscreen></iframe>',
);
// do the tags
$str = preg_replace (array_keys($replace), array_values($replace), $str);
return $str;
}
This will, however, not work good for nested tags. To replace a code with nested tags, I am using a loop like this:
$str = "... text to process ...";
// remove unwanted tags
$str = strip_tags($str);
// make entities of special chars (not quotes)
$str = htmlentities($str, ENT_NOQUOTES, $encoding = 'UTF-8');
$str_old="";
do {
$str_old=$str;
$str=replaceBBcode($str);
} while ($str_old != $str);
// now $str contains the final html tags
Upvotes: 1