sasfrog
sasfrog

Reputation: 2460

Minimize (compress; deflate) html for database storage: is it necessary?

I am storing the HTML from the body of emails in a SQL Server nvarchar(max) column. Is there any benefit in minimizing the HTML on the way in?

By minimizing I mean removing redundant white space and carriage returns/linefeeds in the HTML text stream. My terminology might not be quite right: I'm not looking at removing any HTML tags/comments or anything like that.

By benefit I mean in terms of efficiency of storage space, speed of insert/retrieval, so benefits are focused on the database side.

If it is worthwhile to do, what should I look out for (e.g. if I replace linefeeds with a single space, might it render the HTML incorrectly at a later time)?

Upvotes: 1

Views: 1013

Answers (2)

gbn
gbn

Reputation: 432301

HTML will be just be stored as a BLOB in the database. You won't be able to parse it, search it etc (well, you technically can but that's silly). In that case, you can (un)compress it in the client and send it+store it as varbinary(max) in the database.

The trade off is CPU time to manage compression vs increased storage+network traffic.

I wouldn't sanitise the HTML because you'll lose readability and possibly original content.

Upvotes: 1

synthesizerpatel
synthesizerpatel

Reputation: 28036

You'd still have to have a full HTML parser to understand what's HTML and whats not. Most browsers do a bit of 'fixing up' to make otherwise unpresentable HTML graphically renderable -- in such a way that without fully parsing the tree would be impossible.

Someone could stick some bad HTML in that'd goof up your 'simple' parser pretty easily more often by mistake than malice. Don't get in the business of fixing HTML, handle it verbatim and let the bad content hang itself.

Upvotes: 1

Related Questions