Reputation: 169
I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags
is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument
-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags
is obviously a complete "no-no", but so is my DOMDocument
code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags
or something built into PHP for this crucial and common task?
Upvotes: 1
Views: 1286
Reputation: 6179
HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument
-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)
Upvotes: 0