Jace
Jace

Reputation: 817

Is it possible to query GZIP document stored as Binary data in SQL Server?

I have about thirty-thousand Binary records, all compressed using GZIP, and I need to search the contents of each document for a specified keyword. Currently, I am downloading and extracting all documents at startup. This works well enough, but I expect to be adding another ten-thousand each year. Ideally, I would like to perform a SELECT statement on the Binary column itself, but I have no idea how to go about it, or if this is even possible. I would like to perform this transaction with the least possible amount of data leaving the server. Any help would be appreciated.

EDIT: The Sql records are not compressed. What I mean to say is that I'm compressing the data locally and uploading the compressed files to a SQL Server column of Binary data type. I'm looking for a way to query that compressed data without downloading and decompressing each and every document. The data was stored this way to minimize overhead and reduce transfer cost, but the data must also be queried. It looks like I may have to store two versions of the data on the server, one compressed to be downloaded by the user, and one decompressed to allow search operations to be performed. Is there a more efficient approach?

Upvotes: 1

Views: 1232

Answers (2)

Roland
Roland

Reputation: 5232

Why compressing 30,000 or 40,000 records? Does not sound like a whole lot of data, of course depending of the average size of a record.

For keyword searching, you should not compress the database records. But to save on disk space, in most operating systems, it is possible to compress data on the file level, without the SQL Server even noticing.


update:

As Vladimir pointed out, SQL Server does not run on a compressed file system. Then you could store that data in TWO columns: once uncompressed, for keyword searching, and once compressed, for improved data transfer.

Storing data in a separate searchable column is not uncommon. For example, if you want to search on a combination of fields, you might as well store that combination in a search column, so that you could index that column to accelerate searching. In your case, you might store the data in the search column all lower cased, and with accented characters converted to ascii, and add an index, to accelerate case-insensitive searching on ascii keywords.

In fact, Vladimir already suggested this.

Upvotes: 0

Vladimir Baranov
Vladimir Baranov

Reputation: 32693

SQL Server has a Full-Text Search feature. It will not work on the data that you compressed in your application, of course. You have to store it in plain-text in the database. But, it is designed specifically for this kind of search, so performance should be good.

SQL Server can also compress the data in rows or in pages, but this feature is not available in every edition of SQL Server. For more information, see Features Supported by the Editions of SQL Server. You have to measure the impact of compression on your queries.

Another possibility is to write your own CLR functions that would work on the server - load the compressed binary column, decompress it and do the search. Most likely performance would be worse than using the built-in features.


Taking your updated question into account.

I think your idea to store two versions of the data is good.

Store compressed binary data for efficient transfer to and from the server.

Store secondary copy of the data in an uncompressed format with proper indexes (consider full-text indexes) for efficient search by keywords.

Consider using CLR function to help during inserts. You can transfer only compressed data to the server, then call CLR function that would decompress it on the server and populate the secondary table with uncompressed data and indexes.

Thus you'll have both efficient storage/retrieval plus efficient searches at the expense of the extra storage on the server. You can think of that extra storage as an extra structure for the index that helps with searches.

Upvotes: 2

Related Questions