fent
fent

Reputation: 18205

What precautions should I take to prevent XSS on user submitted HTML?

I'm planning on making a web app that will allow users to post entire web pages on my website. I'm thinking of using HTML Purifier but I'm not sure because HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted. So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.

I saw a Google video a while ago that had a solution for this. Their solution was to use another website to post javascript in so the original website cannot be accessed by it. But I don't wanna purchase a new domain just for this.

Upvotes: 4

Views: 737

Answers (6)

Eli Grey
Eli Grey

Reputation: 35820

You should filter ALL HTML and whitelist only the tags and attributes that are safe and semantically useful. WordPress is great at this and I assume that you will find the regular expressions used by WordPress if you search their source code.

Upvotes: 0

bobince
bobince

Reputation: 536329

If you can find any other way of letting users post content, that does not involve HTML, do that. There are plenty of user-side light markup systems you can use to generate HTML.

So I was thinking making some regex to get rid of all script tags and all the javascript attributes like onload, onclick, etc.

Forget it. You cannot process HTML with regex in any useful way. Let alone when security is involved and attackers might be deliberately throwing malformed markup at you.

If you can convince your users to input XHTML, that's much easier to parse. You still can't do it with regex, but you can throw it into a simple XML parser, and walk over the resulting node tree to check that every element and attribute is known-safe, and delete any that aren't, then re-serialise.

HTML Purifier edits the HTLM and it's important that the HTML is maintained just how it was posted.

Why?

If it's so they can edit it in their original form, then the answer is simply to purify it on the way out to be displayed in the browser, not on the way in at submit-time.

If you must let users input their own free-form HTML — and in general I'd advise against it — then HTML Purifier, with a whitelist approach (ban all elements/attributes that aren't known-safe) is about as good as it gets. It's very very complicated and you may have to keep it up to date when hacks are found, but it's streets ahead of anything you're going to hack up yourself with regexes.

But I don't wanna purchase a new domain just for this.

You can use a subdomain, as long as any authentication tokens (in particular, cookies) can't cross between subdomains. (Which for cookies they can't by default as the domain parameter is set to only the current hostname.)

Do you trust your users with scripting capability? If not don't let them have it, or you'll get attack scripts and iframes to Russian exploit/malware sites all over the place...

Upvotes: 3

Noon Silk
Noon Silk

Reputation: 55062

The most critical error people make when doing this is validating things on input.

Instead, you should validate on display.

The context matters when determing what is XSS and what isn't. Therefore, you can happily accept any input, as long as you pass it through appropriate cleaning functions when displaying it.

Consider that something that constitutes 'XSS' will be different when the input is placed in a '&lt;a href="HERE"> as opposed to <a>here!</a>.

Thus, all you need to do, is make sure that any time you write user data, you consider, very carefully, where you are displaying it, and make sure that it can't escape the context you are writing it to.

Upvotes: 4

austin cheney
austin cheney

Reputation:

1) Use clean simple directory based URIs to serve user feed data. Make sure when you dynamically create URIs to address the user's uploaded data, service account, or anything else off your domain make sure you don't post information as parameters to the URI. That is an extremely easy point of manipulation that could be used to expose flaws in your server security and even possibly inject code onto your server.

2) Patch your server. Ensure you keep your server up to date on all the latest security patches for all the services running on that server.

3) Take all possible server-side protections against SQL injection. If somebody can inject code to your SQL database that can execute from services on your box that person will own your box. At that point they can then install malware onto your webserver to be feed back to your users or simple record data from the server and send it out to a malicious party.

4) Force all new uploads into a protected sandboxed area to test for script execution. No matter how you try to remove script tags from submitted code there will be a way to circumvent your safeguards to execute script. Browsers are sloppy and do all kinds of stupid crap they are not supposed to do. Test your submissions in a safe area before you publish them for public consumption.

5) Check for beacons in submitted code. This step requires the previous step and can be very complicated, because it can occur in script code that requires a browser plugin to execute, such as Action Script, but is just as much a vulnerability as allowing JavaScript to execute from user submitted code. If a user can submit code that can beacon out to a third party then your users, and possibly your server, is completely exposed to data loss to a malicious third party.

Upvotes: 0

Charles Ma
Charles Ma

Reputation: 49131

be careful with homebrew regexes for this kind of thing

A regex like

s/(<.*?)onClick=['"].*?['"](.*?>)/$1 $3/

looks like it might get rid of onclick events, but you can circumvent it with

<a onClick<a onClick="malicious()">="malicious()">

running the regex on that will get you something like

<a onClick ="malicious()">

You can fix it by repeatedly running the regex on that string until it doesn't match, but that's just one example of how easy it is to get around simple regex sanitizers.

Upvotes: 5

Tyler Carter
Tyler Carter

Reputation: 61547

Make sure that user content doesn't contain anything that could cause Javascript to be ran on your page.

You can do this by using an HTML stripping function that gets rid of all HTML tags (like strip_tags from PHP), or by using another similar tool. There are actually many reasons besides XSS to do this. If you have user submitted content, you want to make sure that it doesn't break the site layout.

I belive you can simply use a sub-domain of your current domain to host Javascript, and you will get the same security benefits for AJAX. Not cookies however.


In your specific case, filtering out the <script> tag and Javascript actions is probably going to be your best bet.

Upvotes: 3

Related Questions