Reputation: 3240
I've spent quite a bit of time trying to figure out the best way to handle this. I'm HTML encoding rich text from untrusted user input prior to storing it in the database.
I've bounce back and forth between multiple discussions, and it seems the safest method is to:
HTML encode absolutely everything, and only decode based on a white/safe list prior to sending it back to the client.
However, I'm also seeing strong suggestions for using http://htmlagilitypack.codeplex.com/
This compares user input against your safe/white list.
I've read:
C# HtmlDecode Specific tags only
https://eksith.wordpress.com/2011/06/14/whitelist-santize-htmlagilitypack/
And really, about 10 other posts and have become frustrated because now I can't figure out the best way to handle this.
I've tried using regular expressions to use regex replace methods:
For Each tag In AcceptableTags.Split(CChar("|")).ToList()
pattern = "<" + "\s*/?\s*" + tag + ".*?" + ">"
Regex = New Regex(pattern)
input = Regex.Replace(input, pattern)
Next
This doesn't seems to work well at all.
Is there someone out there who has a tried and true method with an example implementation they wouldn't mind sharing? I'll take c# or vb.net.
Upvotes: 1
Views: 270
Reputation: 5697
Depends on your data. Whitelist on the initial validation is fine if, for example, you're trying to avoid HTML in a phone number. On the other hand, if you can't be specific about what's in and what's out then just leave it "raw".
It's highly unlikely that storing encoded data in a database is the correct thing to do.
Any system of even marginal complexity will have non-HTML clients it will have to serve data to. When you do have an HTML client, you need to escape the output appropriate to HTML. Same for XML. Similarly, if you decide today you like JSON better, you'll encode to that. CSV? No problem - put quotes around your values (and escape any quotes) in case they have commas. Use parameters when doing SQL. Get the idea?
TL;DR;
Saving specifically encoded data is probably wrong
Always, always, always escape appropriate to your output
Never try and do your own escaping - always use a trusted library. You will never do a good enough job.
Upvotes: 2