Reputation: 8512

Best regex to catch XSS (Cross-site Scripting) attack (in Java)?

Jeff actually posted about this in Sanitize HTML. But his example is in C# and I'm actually more interested in a Java version. Does anyone have a better version for Java? Is his example good enough to just convert directly from C# to Java?

Upvotes: 33

Answers (10)

Ashhh

Reputation: 1

public String validate(String value) {


    // Avoid anything between script tags
    Pattern scriptPattern = Pattern.compile("<script>(.*?)</script>", Pattern.CASE_INSENSITIVE);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid anything in a src='...' type of expression
    scriptPattern = Pattern.compile("src[\r\n]*=[\r\n]*\\\'(.*?)\\\'", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid anything in a src="..." type of expression
    scriptPattern = Pattern.compile("src[\r\n]*=[\r\n]*\\\"(.*?)\\\"", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid anything in a src=... type of expression added because quotes are not necessary
    scriptPattern = Pattern.compile("src[\r\n]*=[\r\n]*(.*?)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Remove any lonesome </script> tag
    scriptPattern = Pattern.compile("</script>", Pattern.CASE_INSENSITIVE);
    value = scriptPattern.matcher(value).replaceAll("");

    // Remove any lonesome <script ...> tag
    scriptPattern = Pattern.compile("<script(.*?)>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid eval(...) expressions
    scriptPattern = Pattern.compile("eval\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid expression(...) expressions
    scriptPattern = Pattern.compile("expression\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid javascript:... expressions
    scriptPattern = Pattern.compile("javascript:", Pattern.CASE_INSENSITIVE);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid vbscript:... expressions
    scriptPattern = Pattern.compile("vbscript:", Pattern.CASE_INSENSITIVE);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid onload= expressions
    scriptPattern = Pattern.compile("onload(.*?)=", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid anything between script tags  added - paranoid regex.  note:  if testing local PREP this must be commented
    scriptPattern = Pattern.compile("<(.*?)[\r\n]*(.*?)>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    //  Avoid anything between script tags  added - paranoid regex
    scriptPattern = Pattern.compile("<script(.*?)[\r\n]*(.*?)/script>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Avoid anything between * tags like *(alert)*  added
    scriptPattern = Pattern.compile("\\*(.*?)[\r\n]*(.*?)\\*", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");
    
    // Avoid anything between + tags like +(alert)+  added
    scriptPattern = Pattern.compile("\\+(.*?)[\r\n]*(.*?)\\+", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");

    // Prohibit lines containing = (...) added
    scriptPattern = Pattern.compile("=(.*?)\\((.*?)\\)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");
    
    //  removing href link 
    scriptPattern = Pattern.compile("(?i)<[\\s]*[/]?[\\s]*a[^>]*>", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");
    //Avoid alert
    scriptPattern = Pattern.compile("alert", Pattern.CASE_INSENSITIVE);
    value = scriptPattern.matcher(value).replaceAll("");


    scriptPattern = Pattern.compile("[^\\dA-Za-z ]", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
    value = scriptPattern.matcher(value).replaceAll("");
    
    
    return value;
            
}

Upvotes: 0

Philip DiSarro

Reputation: 1025

This question perfectly illustrates a great application of the study of Theory of Computation. Theory of Computation is a field that focuses on producing and studying mathematical representations for computation.

Some of the most profound research in computation theory includes the proofs that illustrate the relationships of various languages.

Some of the language relationships that computation theorists have proven include:

This shows that context free languages are strictly more powerful than regular languages. Thus if a language is explicitly context-free (context-free and not regular), then it is impossible for any regular expression to recognize it.

JavaScript is at the very least context-free, thus we know with one-hundred percent certainty that designing a regular expression (regex) capable of catching all XSS is a mathematically impossible task.

Upvotes: 1

NoNaMe

Reputation: 6230

For java, I used the following regular expression with replaceAll, and worked for me

value.replaceAll("(?i)(\\b)(on\\S+)(\\s*)=|javascript:|(<\\s*)(\\/*)script|style(\\s*)=|(<\\s*)meta", "");

Added (?i) to ignore case for alphabets.

Upvotes: 0

KIC

Reputation: 6121

An old thread but maybe this will be useful for other users. There is a maintained security layer tool for php: https://github.com/PHPIDS/ It is based on a set of regex which you can find here:

https://github.com/PHPIDS/PHPIDS/blob/master/lib/IDS/default_filter.xml

Upvotes: 0

user3709489

Reputation: 79

I extracted from NoScript best Anti-XSS addon, here is its Regex: Work flawless:

<[^\w<>]*(?:[^<>"'\s]*:)?[^\w<>]*(?:\W*s\W*c\W*r\W*i\W*p\W*t|\W*f\W*o\W*r\W*m|\W*s\W*t\W*y\W*l\W*e|\W*s\W*v\W*g|\W*m\W*a\W*r\W*q\W*u\W*e\W*e|(?:\W*l\W*i\W*n\W*k|\W*o\W*b\W*j\W*e\W*c\W*t|\W*e\W*m\W*b\W*e\W*d|\W*a\W*p\W*p\W*l\W*e\W*t|\W*p\W*a\W*r\W*a\W*m|\W*i?\W*f\W*r\W*a\W*m\W*e|\W*b\W*a\W*s\W*e|\W*b\W*o\W*d\W*y|\W*m\W*e\W*t\W*a|\W*i\W*m\W*a?\W*g\W*e?|\W*v\W*i\W*d\W*e\W*o|\W*a\W*u\W*d\W*i\W*o|\W*b\W*i\W*n\W*d\W*i\W*n\W*g\W*s|\W*s\W*e\W*t|\W*i\W*s\W*i\W*n\W*d\W*e\W*x|\W*a\W*n\W*i\W*m\W*a\W*t\W*e)[^>\w])|(?:<\w[\s\S]*[\s\0\/]|['"])(?:formaction|style|background|src|lowsrc|ping|on(?:d(?:e(?:vice(?:(?:orienta|mo)tion|proximity|found|light)|livery(?:success|error)|activate)|r(?:ag(?:e(?:n(?:ter|d)|xit)|(?:gestur|leav)e|start|drop|over)?|op)|i(?:s(?:c(?:hargingtimechange|onnect(?:ing|ed))|abled)|aling)|ata(?:setc(?:omplete|hanged)|(?:availabl|chang)e|error)|urationchange|ownloading|blclick)|Moz(?:M(?:agnifyGesture(?:Update|Start)?|ouse(?:PixelScroll|Hittest))|S(?:wipeGesture(?:Update|Start|End)?|crolledAreaChanged)|(?:(?:Press)?TapGestur|BeforeResiz)e|EdgeUI(?:C(?:omplet|ancel)|Start)ed|RotateGesture(?:Update|Start)?|A(?:udioAvailable|fterPaint))|c(?:o(?:m(?:p(?:osition(?:update|start|end)|lete)|mand(?:update)?)|n(?:t(?:rolselect|extmenu)|nect(?:ing|ed))|py)|a(?:(?:llschang|ch)ed|nplay(?:through)?|rdstatechange)|h(?:(?:arging(?:time)?ch)?ange|ecking)|(?:fstate|ell)change|u(?:echange|t)|l(?:ick|ose))|m(?:o(?:z(?:pointerlock(?:change|error)|(?:orientation|time)change|fullscreen(?:change|error)|network(?:down|up)load)|use(?:(?:lea|mo)ve|o(?:ver|ut)|enter|wheel|down|up)|ve(?:start|end)?)|essage|ark)|s(?:t(?:a(?:t(?:uschanged|echange)|lled|rt)|k(?:sessione|comma)nd|op)|e(?:ek(?:complete|ing|ed)|(?:lec(?:tstar)?)?t|n(?:ding|t))|u(?:ccess|spend|bmit)|peech(?:start|end)|ound(?:start|end)|croll|how)|b(?:e(?:for(?:e(?:(?:scriptexecu|activa)te|u(?:nload|pdate)|p(?:aste|rint)|c(?:opy|ut)|editfocus)|deactivate)|gin(?:Event)?)|oun(?:dary|ce)|l(?:ocked|ur)|roadcast|usy)|a(?:n(?:imation(?:iteration|start|end)|tennastatechange)|fter(?:(?:scriptexecu|upda)te|print)|udio(?:process|start|end)|d(?:apteradded|dtrack)|ctivate|lerting|bort)|DOM(?:Node(?:Inserted(?:IntoDocument)?|Removed(?:FromDocument)?)|(?:CharacterData|Subtree)Modified|A(?:ttrModified|ctivate)|Focus(?:Out|In)|MouseScroll)|r(?:e(?:s(?:u(?:m(?:ing|e)|lt)|ize|et)|adystatechange|pea(?:tEven)?t|movetrack|trieving|ceived)|ow(?:s(?:inserted|delete)|e(?:nter|xit))|atechange)|p(?:op(?:up(?:hid(?:den|ing)|show(?:ing|n))|state)|a(?:ge(?:hide|show)|(?:st|us)e|int)|ro(?:pertychange|gress)|lay(?:ing)?)|t(?:ouch(?:(?:lea|mo)ve|en(?:ter|d)|cancel|start)|ime(?:update|out)|ransitionend|ext)|u(?:s(?:erproximity|sdreceived)|p(?:gradeneeded|dateready)|n(?:derflow|load))|f(?:o(?:rm(?:change|input)|cus(?:out|in)?)|i(?:lterchange|nish)|ailed)|l(?:o(?:ad(?:e(?:d(?:meta)?data|nd)|start)?|secapture)|evelchange|y)|g(?:amepad(?:(?:dis)?connected|button(?:down|up)|axismove)|et)|e(?:n(?:d(?:Event|ed)?|abled|ter)|rror(?:update)?|mptied|xit)|i(?:cc(?:cardlockerror|infochange)|n(?:coming|valid|put))|o(?:(?:(?:ff|n)lin|bsolet)e|verflow(?:changed)?|pen)|SVG(?:(?:Unl|L)oad|Resize|Scroll|Abort|Error|Zoom)|h(?:e(?:adphoneschange|l[dp])|ashchange|olding)|v(?:o(?:lum|ic)e|ersion)change|w(?:a(?:it|rn)ing|heel)|key(?:press|down|up)|(?:AppComman|Loa)d|no(?:update|match)|Request|zoom))[\s\0]*=

Test: http://regex101.com/r/rV7zK8

I think it block 99% XSS because it is a part of NoScript, a addon that get updated regularly

Upvotes: 7

atoms

Reputation:

^(\s|\w|\d|<br>)*?$

This will validate characters, digits, whitespaces and also the <br> tag. If you want more risk you can add more tags like

^(\s|\w|\d|<br>|<ul>|<\ul>)*?$

Upvotes: 3

Brian

Reputation: 25824

[\s\w\.]*. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with & is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.

The idea to just parse it as html and generate new html is probably better.

Upvotes: 0

Chase Seibert

Reputation: 15851

Don't do this with regular expressions. Remember, you're not protecting just against valid HTML; you're protecting against the DOM that web browsers create. Browsers can be tricked into producing valid DOM from invalid HTML quite easily.

For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmail on IE6/7/8?

<HTML><BODY>
<?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time">
<?import namespace="t" implementation="#default#time2">
<t:set attributeName="innerHTML" to="XSS&lt;SCRIPT DEFER&gt;alert(&quot;XSS&quot;)&lt;/SCRIPT&gt;">
</BODY></HTML>

How about this attack that works on IE6?

<TABLE BACKGROUND="javascript:alert('XSS')">

How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. As someone on that page adeptly notes:

The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.

I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurable whitelist. The major difference is the ability to gracefully handle malformed HTML.

The best part is that it actually unit tests for all the XSS attacks on the above site. Besides, what could be easier than this API call:

public String toSafeHtml(String html) throws ScanException, PolicyException {

    Policy policy = Policy.getInstance(POLICY_FILE);
    AntiSamy antiSamy = new AntiSamy();
    CleanResults cleanResults = antiSamy.scan(html, policy);
    return cleanResults.getCleanHTML().trim();
}

Upvotes: 79

Emmanuel Rodriguez

Reputation: 1594

I'm not to convinced that using a regular expression is the best way for finding all suspect code. Regular expressions are quite easy to trick specially when dealing with broken HTML. For example, the regular expression listed in the Sanitize HTML link will fail to remove all 'a' elements that have an attribute between the element name and the attribute 'href':

< a alt="xss injection" href="http://www.malicous.com/bad.php" >

A more robust way of removing malicious code is to rely on a XML Parser that can handle all kind of HTML documents (Tidy, TagSoup, etc) and to select the elements to remove with an XPath expression. Once the HTML document is parsed into a DOM document the elements to revome can be found easily and safely. This is even easy to do with XSLT.

Upvotes: 4

svrist

Reputation: 7110

The biggest problem by using jeffs code is the @ which currently isnt available.

I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into

http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

and see the things needing escape get escaped and then use it.

Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.

(Thats propbably pretty sound advice for any copy/paste)

Upvotes: 0

Best regex to catch XSS (Cross-site Scripting) attack (in Java)?

Answers (10)

Related Questions