Reputation: 8512
Jeff actually posted about this in Sanitize HTML. But his example is in C# and I'm actually more interested in a Java version. Does anyone have a better version for Java? Is his example good enough to just convert directly from C# to Java?
Upvotes: 33
Views: 106744
Reputation: 1
public String validate(String value) {
// Avoid anything between script tags
Pattern scriptPattern = Pattern.compile("<script>(.*?)</script>", Pattern.CASE_INSENSITIVE);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything in a src='...' type of expression
scriptPattern = Pattern.compile("src[\r\n]*=[\r\n]*\\\'(.*?)\\\'", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything in a src="..." type of expression
scriptPattern = Pattern.compile("src[\r\n]*=[\r\n]*\\\"(.*?)\\\"", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything in a src=... type of expression added because quotes are not necessary
scriptPattern = Pattern.compile("src[\r\n]*=[\r\n]*(.*?)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Remove any lonesome </script> tag
scriptPattern = Pattern.compile("</script>", Pattern.CASE_INSENSITIVE);
value = scriptPattern.matcher(value).replaceAll("");
// Remove any lonesome <script ...> tag
scriptPattern = Pattern.compile("<script(.*?)>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid eval(...) expressions
scriptPattern = Pattern.compile("eval\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid expression(...) expressions
scriptPattern = Pattern.compile("expression\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid javascript:... expressions
scriptPattern = Pattern.compile("javascript:", Pattern.CASE_INSENSITIVE);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid vbscript:... expressions
scriptPattern = Pattern.compile("vbscript:", Pattern.CASE_INSENSITIVE);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid onload= expressions
scriptPattern = Pattern.compile("onload(.*?)=", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything between script tags added - paranoid regex. note: if testing local PREP this must be commented
scriptPattern = Pattern.compile("<(.*?)[\r\n]*(.*?)>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything between script tags added - paranoid regex
scriptPattern = Pattern.compile("<script(.*?)[\r\n]*(.*?)/script>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything between * tags like *(alert)* added
scriptPattern = Pattern.compile("\\*(.*?)[\r\n]*(.*?)\\*", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Avoid anything between + tags like +(alert)+ added
scriptPattern = Pattern.compile("\\+(.*?)[\r\n]*(.*?)\\+", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// Prohibit lines containing = (...) added
scriptPattern = Pattern.compile("=(.*?)\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
// removing href link
scriptPattern = Pattern.compile("(?i)<[\\s]*[/]?[\\s]*a[^>]*>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
//Avoid alert
scriptPattern = Pattern.compile("alert", Pattern.CASE_INSENSITIVE);
value = scriptPattern.matcher(value).replaceAll("");
scriptPattern = Pattern.compile("[^\\dA-Za-z ]", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
value = scriptPattern.matcher(value).replaceAll("");
return value;
}
Upvotes: 0
Reputation: 1025
This question perfectly illustrates a great application of the study of Theory of Computation. Theory of Computation is a field that focuses on producing and studying mathematical representations for computation.
Some of the most profound research in computation theory includes the proofs that illustrate the relationships of various languages.
Some of the language relationships that computation theorists have proven include:
This shows that context free languages are strictly more powerful than regular languages. Thus if a language is explicitly context-free (context-free and not regular), then it is impossible for any regular expression to recognize it.
JavaScript is at the very least context-free, thus we know with one-hundred percent certainty that designing a regular expression (regex) capable of catching all XSS is a mathematically impossible task.
Upvotes: 1
Reputation: 6230
For java, I used the following regular expression with replaceAll, and worked for me
value.replaceAll("(?i)(\\b)(on\\S+)(\\s*)=|javascript:|(<\\s*)(\\/*)script|style(\\s*)=|(<\\s*)meta", "");
Added (?i) to ignore case for alphabets.
Upvotes: 0
Reputation: 6121
An old thread but maybe this will be useful for other users. There is a maintained security layer tool for php: https://github.com/PHPIDS/ It is based on a set of regex which you can find here:
https://github.com/PHPIDS/PHPIDS/blob/master/lib/IDS/default_filter.xml
Upvotes: 0
Reputation: 79
I extracted from NoScript best Anti-XSS addon, here is its Regex: Work flawless:
<[^\w<>]*(?:[^<>"'\s]*:)?[^\w<>]*(?:\W*s\W*c\W*r\W*i\W*p\W*t|\W*f\W*o\W*r\W*m|\W*s\W*t\W*y\W*l\W*e|\W*s\W*v\W*g|\W*m\W*a\W*r\W*q\W*u\W*e\W*e|(?:\W*l\W*i\W*n\W*k|\W*o\W*b\W*j\W*e\W*c\W*t|\W*e\W*m\W*b\W*e\W*d|\W*a\W*p\W*p\W*l\W*e\W*t|\W*p\W*a\W*r\W*a\W*m|\W*i?\W*f\W*r\W*a\W*m\W*e|\W*b\W*a\W*s\W*e|\W*b\W*o\W*d\W*y|\W*m\W*e\W*t\W*a|\W*i\W*m\W*a?\W*g\W*e?|\W*v\W*i\W*d\W*e\W*o|\W*a\W*u\W*d\W*i\W*o|\W*b\W*i\W*n\W*d\W*i\W*n\W*g\W*s|\W*s\W*e\W*t|\W*i\W*s\W*i\W*n\W*d\W*e\W*x|\W*a\W*n\W*i\W*m\W*a\W*t\W*e)[^>\w])|(?:<\w[\s\S]*[\s\0\/]|['"])(?:formaction|style|background|src|lowsrc|ping|on(?:d(?:e(?:vice(?:(?:orienta|mo)tion|proximity|found|light)|livery(?:success|error)|activate)|r(?:ag(?:e(?:n(?:ter|d)|xit)|(?:gestur|leav)e|start|drop|over)?|op)|i(?:s(?:c(?:hargingtimechange|onnect(?:ing|ed))|abled)|aling)|ata(?:setc(?:omplete|hanged)|(?:availabl|chang)e|error)|urationchange|ownloading|blclick)|Moz(?:M(?:agnifyGesture(?:Update|Start)?|ouse(?:PixelScroll|Hittest))|S(?:wipeGesture(?:Update|Start|End)?|crolledAreaChanged)|(?:(?:Press)?TapGestur|BeforeResiz)e|EdgeUI(?:C(?:omplet|ancel)|Start)ed|RotateGesture(?:Update|Start)?|A(?:udioAvailable|fterPaint))|c(?:o(?:m(?:p(?:osition(?:update|start|end)|lete)|mand(?:update)?)|n(?:t(?:rolselect|extmenu)|nect(?:ing|ed))|py)|a(?:(?:llschang|ch)ed|nplay(?:through)?|rdstatechange)|h(?:(?:arging(?:time)?ch)?ange|ecking)|(?:fstate|ell)change|u(?:echange|t)|l(?:ick|ose))|m(?:o(?:z(?:pointerlock(?:change|error)|(?:orientation|time)change|fullscreen(?:change|error)|network(?:down|up)load)|use(?:(?:lea|mo)ve|o(?:ver|ut)|enter|wheel|down|up)|ve(?:start|end)?)|essage|ark)|s(?:t(?:a(?:t(?:uschanged|echange)|lled|rt)|k(?:sessione|comma)nd|op)|e(?:ek(?:complete|ing|ed)|(?:lec(?:tstar)?)?t|n(?:ding|t))|u(?:ccess|spend|bmit)|peech(?:start|end)|ound(?:start|end)|croll|how)|b(?:e(?:for(?:e(?:(?:scriptexecu|activa)te|u(?:nload|pdate)|p(?:aste|rint)|c(?:opy|ut)|editfocus)|deactivate)|gin(?:Event)?)|oun(?:dary|ce)|l(?:ocked|ur)|roadcast|usy)|a(?:n(?:imation(?:iteration|start|end)|tennastatechange)|fter(?:(?:scriptexecu|upda)te|print)|udio(?:process|start|end)|d(?:apteradded|dtrack)|ctivate|lerting|bort)|DOM(?:Node(?:Inserted(?:IntoDocument)?|Removed(?:FromDocument)?)|(?:CharacterData|Subtree)Modified|A(?:ttrModified|ctivate)|Focus(?:Out|In)|MouseScroll)|r(?:e(?:s(?:u(?:m(?:ing|e)|lt)|ize|et)|adystatechange|pea(?:tEven)?t|movetrack|trieving|ceived)|ow(?:s(?:inserted|delete)|e(?:nter|xit))|atechange)|p(?:op(?:up(?:hid(?:den|ing)|show(?:ing|n))|state)|a(?:ge(?:hide|show)|(?:st|us)e|int)|ro(?:pertychange|gress)|lay(?:ing)?)|t(?:ouch(?:(?:lea|mo)ve|en(?:ter|d)|cancel|start)|ime(?:update|out)|ransitionend|ext)|u(?:s(?:erproximity|sdreceived)|p(?:gradeneeded|dateready)|n(?:derflow|load))|f(?:o(?:rm(?:change|input)|cus(?:out|in)?)|i(?:lterchange|nish)|ailed)|l(?:o(?:ad(?:e(?:d(?:meta)?data|nd)|start)?|secapture)|evelchange|y)|g(?:amepad(?:(?:dis)?connected|button(?:down|up)|axismove)|et)|e(?:n(?:d(?:Event|ed)?|abled|ter)|rror(?:update)?|mptied|xit)|i(?:cc(?:cardlockerror|infochange)|n(?:coming|valid|put))|o(?:(?:(?:ff|n)lin|bsolet)e|verflow(?:changed)?|pen)|SVG(?:(?:Unl|L)oad|Resize|Scroll|Abort|Error|Zoom)|h(?:e(?:adphoneschange|l[dp])|ashchange|olding)|v(?:o(?:lum|ic)e|ersion)change|w(?:a(?:it|rn)ing|heel)|key(?:press|down|up)|(?:AppComman|Loa)d|no(?:update|match)|Request|zoom))[\s\0]*=
Test: http://regex101.com/r/rV7zK8
I think it block 99% XSS because it is a part of NoScript, a addon that get updated regularly
Upvotes: 7
Reputation:
^(\s|\w|\d|<br>)*?$
This will validate characters, digits, whitespaces and also the <br>
tag.
If you want more risk you can add more tags like
^(\s|\w|\d|<br>|<ul>|<\ul>)*?$
Upvotes: 3
Reputation: 25824
[\s\w\.]*
. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with &
is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.
The idea to just parse it as html and generate new html is probably better.
Upvotes: 0
Reputation: 15851
Don't do this with regular expressions. Remember, you're not protecting just against valid HTML; you're protecting against the DOM that web browsers create. Browsers can be tricked into producing valid DOM from invalid HTML quite easily.
For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmail on IE6/7/8?
<HTML><BODY>
<?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time">
<?import namespace="t" implementation="#default#time2">
<t:set attributeName="innerHTML" to="XSS<SCRIPT DEFER>alert("XSS")</SCRIPT>">
</BODY></HTML>
How about this attack that works on IE6?
<TABLE BACKGROUND="javascript:alert('XSS')">
How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. As someone on that page adeptly notes:
The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.
I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurable whitelist. The major difference is the ability to gracefully handle malformed HTML.
The best part is that it actually unit tests for all the XSS attacks on the above site. Besides, what could be easier than this API call:
public String toSafeHtml(String html) throws ScanException, PolicyException {
Policy policy = Policy.getInstance(POLICY_FILE);
AntiSamy antiSamy = new AntiSamy();
CleanResults cleanResults = antiSamy.scan(html, policy);
return cleanResults.getCleanHTML().trim();
}
Upvotes: 79
Reputation: 1594
I'm not to convinced that using a regular expression is the best way for finding all suspect code. Regular expressions are quite easy to trick specially when dealing with broken HTML. For example, the regular expression listed in the Sanitize HTML link will fail to remove all 'a' elements that have an attribute between the element name and the attribute 'href':
< a alt="xss injection" href="http://www.malicous.com/bad.php" >
A more robust way of removing malicious code is to rely on a XML Parser that can handle all kind of HTML documents (Tidy, TagSoup, etc) and to select the elements to remove with an XPath expression. Once the HTML document is parsed into a DOM document the elements to revome can be found easily and safely. This is even easy to do with XSLT.
Upvotes: 4
Reputation: 7110
The biggest problem by using jeffs code is the @ which currently isnt available.
I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
and see the things needing escape get escaped and then use it.
Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.
(Thats propbably pretty sound advice for any copy/paste)
Upvotes: 0