ironfist
ironfist

Reputation: 947

.Net: saving compressed files to SQL Server?

We have lots of .Net applications that upload files to and retrieve them from SQL Server (2008). We store the files as varbinary(max) in most places.

This is a common practice, of course, but I wonder if anyone goes the extra mile and compresses their files (i.e. zip) before savings, and decompressing (unzip) when retrieving? Some of our database get up into the 40-100gb range in terms of .mdf file size, and I know that could be trimmed a bit.

Is this a good practice? Anyone know about performance impacts, or have any sample code?

Thanks Tom

Upvotes: 2

Views: 2776

Answers (3)

Solomon Rutzky
Solomon Rutzky

Reputation: 48934

First, we need to look at the additional details revealed in comments on the question and answers:

  1. There is a need to save space is due to the high cost of enterprise storage
  2. File types are: "mostly pdf's, word docs, excel files... e.g. office files. but also some big engineering files like cad's"
  3. There are thousands of files
  4. Many of the files are 1-3 Mb

I tested PDF files from 1 MB to 17 MB and got at best 25% compression while most seem to be around 10% - 15%. In terms of Microsoft Office files, it makes a rather large difference if you are talking about the pre-Office 2007 format (i.e. extensions not ending in "x") or the newer format (i.e. extensions ending in "x"). The newer format (e.g. ".docx", ".xlsx", etc) is already a compressed file so you won't get much any of savings there (to see for yourself: copy/paste any of the office documents with an extension ending in "x", rename the copy to have an extension of "zip" instead of "docx" or whichever it was, and double-click on it). I don't recall how much CAD drawings compress, but I suspect at least as much as the PDFs if not more.

So, assuming 2000 files at 3 MB each that is 6 GB. 10% average compression will save you 600 MB. It would be wise to do some sampling of your own on your files, especially the CAD files, to get a clearer picture of what you will really save. And knowing how much you realistically expect to save should help determine which of the following options makes the most sense given available time / resources.

Your options seem to be:

  1. Compression: This would seem to be the least amount of effort in terms of refactoring your existing code. Compressing and decompressing the files on the web server should be rather trivial (the .Net framework comes with libraries to do Deflate/Inflate as well as Gzip/Ungzip) and there should be no changes needed at the DB layer (you will just need to do a one-time migration of existing data to a compressed format). In a comment on the question, JonSkeet asked if this effort would be worth it. Given the low amount of effort for just this option, I would think that is a clear Yes. At $50/hr (or so) for developer time and a project like this taking maybe 10 hours inclusive of QA, that is $500 of already paid-for (budget-wise) employee time. Getting new SAN space will cost much more than $500 (even more than $1000 if for some reason it takes 20 hours to implement) and usually comes from another budget and needs a requisition, etc.

    And there are other benefits other than the immediate decrease in MDF file size, namely:

    1. All data read from and written to disk goes through the Buffer Pool. The more 8k data pages you need to read and write these files, the more often your real app data is going to get pushed out and need to be read again from disk (which is slow else we wouldn't have read/write caches!). This decreases your Page Life Expectancy.
    2. Backups are smaller! The smaller the data, the smaller the backup files. And the smaller the data to backup and restore, the faster both operations are.
    3. Faster access (more often than not). Disk is the slowest part of the system so if you can access the disk less then there is typically a net gain as the CPU can generally compress/decompress the data faster than the disk can read/write it. Of course, only testing on your system with your data and files will tell, but there is a good reason that Microsoft included built-in data compression at the ROW or PAGE level for indexes starting in SQL Server 2008).
  2. FILESTREAM: http://technet.microsoft.com/en-us/library/bb933993(v=sql.100).aspx This would take a little more effort but moves the files out of the MDF files. If the local file share is also on the SAN, you can consider combining the compression of the files along with using FILESTREAM to get them out of the MDF file.

  3. Remote Blob Storage (RBS): http://msdn.microsoft.com/en-us/library/gg316768.aspx This is an add-on feature that you need to download separately but seems to be designed for exactly this scenario (store external files on cheaper storage). Unlike the above two options, this one might not backup the external file when you backup the database and would require a secondary backup (certainly not impossible). There is a white paper linked at the bottom of that main RBS page that (starting on the bottom of page 40) explains the backup options. The MSDN documentation shows the feature starting with SQL Server 2008 R2 but I don't seen any indication that it wouldn't work in SQL Server 2008.

EDIT:
Some additional notes:

  1. One very nice benefit of the FILESTREAM option, as noted in the documentation, is that it skips the Buffer Pool, so the size of the data is a non-issue with regards to performance (i.e. Page Life Expectancy).

  2. If doing compression (which seems like a good idea regardless of any other options as it can be used in combination with any of them), then choosing a method that is more general is probably best. Meaning, if using the .Net framework options--either Inflate/Deflate or Gzip/Ungzip--using Gzip/Ungzip is probably better because it is easier to find tools to deal with that format.

  3. If storing compressed data in the DB, whether for the one-time data migration or just to have in general to have access to the "raw" data at the DB layer (and not being required to move something to the app layer just to have access to the "real" data), you can implement SQL CLR functions for Gzip / Ungzip. If unsure how to accomplish this, there are numerous examples across various sites, or you can just download and use the [Util_Gzip] and [Util_GUnzip] functions (or [Util_Deflate] and [Util_Inflate]) in SQL# (of which I am the author, but these functions are available in the Free version). On a simplistic level, the one-time data migration could be accomplished by doing something similar to:

    UPDATE tbl
    SET    tbl.VarBinaryField = SQL#.Util_GZip(tbl.VarBinaryField)
    FROM   SchemaName.TableName tbl
    WHERE  tbl.ID BETWEEN @StartID AND @EndID -- do small batches at a time
    

Upvotes: 4

Joshua Shearer
Joshua Shearer

Reputation: 1120

Is this a good practice?

It's a pretty widely debated topic. Another answer to this question gives you a more in-depth backstory on it, so I won't duplicate that here.

Anyone know about performance impacts

When it comes to the performance of what you're suggesting, there's not a single answer that works for everyone. It's dependent on a couple of things:

  • How frequently will you be inserting/retrieving records from this table?
  • Where will the compression/decompression be performed? On the server, or on the client?
  • If the compression is performed on the client, is the client a web server or is it an end-user PC?

Different answers to the above questions will result in vastly different changes to performance.

If you're doing the compression on an end-user PC, then you'll potentially notice some benefits; if the data can be compressed very well (and quickly enough), then sending the data to the database might take less time than sending the uncompressed version. Though, if the data can't be compressed very well (or compresses very slowly), then your end-users might complain about a decrease in performance; it may take less time to send it to the server, but the only thing your end-users will notice is the loading bar while the data is being compressed. You might be able to get around this by conditionally compressing files that are known to compress very well, such as text documents.

If the compression is being performed on a web server, which then writes it to the database, you likely won't see a lot of benefit in terms of speed. Servers are usually connected to each other over very fast connections (usually 100/1000mbit connections if they're in the same data center) and you will have already incurred the most likely bottleneck: the upload speed of the user's internet connection.

At this point you're just putting more load on your web server that could potentially be better spent servicing a greater number of concurrent users of your web application. Of course, you could always upload the files to a staging directory and perform the compression at off-peak hours, but then you've added a lot of complexity (what if the file is requested again before it is compressed and sent to the database?) just to save a few megabytes on your server.

Furthermore, you're going to incur a similar performance cost every time a file is requested, since you'll have to spend time and processing power decompressing it. If you get many requests to download files in a short period of time, your server could slow to a crawl trying to decompress everyone's files before sending them down the wire.

As I said in the beginning, there's no single answer to this that will work for everyone, but if you consider all the factors, you can make an informed decision about what will work best for your environment.

Upvotes: 1

alstonp
alstonp

Reputation: 700

Is it a good practice? Not really. A better solution is to store the file on the filesystem and hold a pointer to that file in your database.

When talking to relational purists, this question is a sore point. From their perspective, no unstructured data should be stored in a database. Microsoft has addressed this issue with the FileStream datatype. In short, it stores the file on the file system as opposed to the mdf files. It stores a pointer to the files which will then allow the server to retrive, replace, delete, etc. the file.

Take a look at MS documentation here

FILESTREAM integrates the SQL Server Database Engine with an NTFS file system by storing varbinary(max) binary large object (BLOB) data as files on the file system. Transact-SQL statements can insert, update, query, search, and back up FILESTREAM data. Win32 file system interfaces provide streaming access to the data.

Upvotes: 3

Related Questions