Ian Boyd
Ian Boyd

Reputation: 257103

Using Full-Text indexing to crawl binary blobs

If i store binary files (e.g. doc, html, xml, xps, docx, pdf) inside a varbinary(max) column in SQL Server, how can i use Full-Text indexing to crawl the binary files?

Imagine i create a table to store binary files:

CREATE TABLE Documents (
    DocumentID int IDENTITY,
    Filename nvarchar(32000),
    Data varbinary(max),
)

How can i leverage the IFilter system provided by Windows to crawl these binary files and extract useful, searchable, information?

The motivation for this, of course, is that Microsoft's Indexing Service is derprecated, and replaced with Windows Search. Indexing Service provided an OLEDB provider (MSIDX that SQL Server could use to query the Indexing Service catalog. The Indexing Service OLE DB Provider

Windows Search, on the other hand has no way to query the catalog. There is no way for SQL Server to access Windows Search.

Fortunately, the capabilities of Windows Search (and Indexing Service before it) were brought into SQL Server proper. The SQL Server Full-Text indexing service uses the same IFilter mechanism that has been around for 19 years.

The question is: how to use it to crawl blobs stored in the database.

Upvotes: 2

Views: 2941

Answers (1)

Ian Boyd
Ian Boyd

Reputation: 257103

SQL Server fulltext can index varbinary and image columns.

You can see the list of all file types currently supported by SQL Server:

SELECT * FROM sys.fulltext_document_types

For example:

| document_type | class_id                             | path                                                                             | version           | manufacturer          |
|---------------|--------------------------------------|----------------------------------------------------------------------------------|-------------------|-----------------------|
| .doc          | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll                                                  | 2008.0.9200.16384 | Microsoft Corporation |
| .txt          | C7310720-AC80-11D1-8DF3-00C04FB6EF4F | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll   | 12.0.6828.0       | Microsoft Corporation |
| .xls          | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll                                                  | 2008.0.9200.16384 | Microsoft Corporation |
| .xml          | 41B9BE05-B3AF-460C-BF0B-2CDD44A093B1 | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\xmlfilt.dll | 12.0.9735.0       | Microsoft Corporation |

When creating the varbinary (or image) column to contain your binary file, you must have another string column that gives the file type through its extension (e.g. ".doc")

CREATE TABLE Documents (
    DocumentID int IDENTITY,
    Filename nvarchar(32000),
    Data varbinary(max),
    DataType varchar(50) --contains the file extension (e.g. ".docx", ".pdf")
)

When adding the binary column to the full-text index SQL Server needs you to tell it which column contains the data type string:

ALTER FULLTEXT INDEX ON [dbo].[Documents] 
ADD ([Data] TYPE COLUMN [DataType])

You can test by importing a binary file from the filesystem on the server:

INSERT INTO Documents(filename, DataType, data) 
SELECT 
   'Managing Storage Spaces with PowerShell.doc' AS Filename, 
   '.doc', * 
FROM OPENROWSET(BULK N'C:\Managing Storage Spaces with PowerShell.doc', SINGLE_BLOB) AS Data

You can view the catalog status using:

DECLARE @CatalogName varchar(50);
SET @CatalogName = 'Scratch';

SELECT
    CASE FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus')
    WHEN 0 THEN 'Idle'
    WHEN 1 THEN 'Full population in progress'
    WHEN 2 THEN 'Paused'
    WHEN 3 THEN 'Throttled'
    WHEN 4 THEN 'Recovering'
    WHEN 5 THEN 'Shutdown'
    WHEN 6 THEN 'Incremental population in progress'
    WHEN 7 THEN 'Building index'
    WHEN 8 THEN 'Disk is full. Paused.'
    WHEN 9 THEN 'Change tracking'
    ELSE 'Unknown'
    END+' ('+CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus') AS varchar(50))+')' AS PopulateStatus,
    FULLTEXTCATALOGPROPERTY(@CatalogName, 'ItemCount') AS ItemCount,
    CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'IndexSize') AS varchar(50))+ ' MiB' AS IndexSize,
    CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'UniqueKeyCount') AS varchar(50))+' words' AS UniqueKeyCount,
    FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAge') AS PopulateCompletionAge,
    DATEADD(second, FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAGe'), 0) AS PopulateCompletionDate

And you can query the catalog:

SELECT * FROM Documents
WHERE FREETEXT(Data, 'Bruce')

Additional IFilters

SQL Server has a limited set of built-in filters. It can also use IFilter implementations registered on the system (e.g. Microsoft Office 2010 Filter Pack that provides docx, msg, one, pub, vsx, xlsx and zip support).

You must enable the use of the OS-level filters by enabling the option:

sp_fulltext_service 'load_os_resources', 1

and restart the SQL Server service.

load_os_resources int

Indicates whether operating system word breakers, stemmers, and filters are registered and used with this instance of SQL Server. One of:

0: Use only filters and word breakers specific to this instance of SQL Server.
1: Load operating system filters and word breakers.

By default, this property is disabled to prevent inadvertent behavior changes by updates made to the operating system. Enabling use of operating system resources provides access to resources for languages and document types registered with Microsoft Indexing Service that do not have an instance-specific resource installed. If you enable the loading of operating system resources, ensure that the operating system resources are trusted signed binaries; otherwise, they cannot be loaded when verify_signature is set to 1.

If using SQL Server before SQL Server 2008, you must also restart the Full-Text indexing service after enabling this option:

net stop msftesql
net start msftesql

Microsoft provides filter packs contain IFilter for the Office 2007 file types:

And Adobe provides an IFilter for indexing PDFs (Foxit provides one, but theirs is not free):

Bonus Reading

Upvotes: 3

Related Questions