Reputation: 3656
I'm working on a knowledge base project using SQL Server 2008 Full Text Search Engine. Project included in articles and files where each article has multiple files.In those articles whole content is pure html.
Right now,I successfully created fulltext catalog and index on SQL Server 2008 and my database is version 10 compatible.
Here are my questions:
1)Is it possible to ignore html tags,more clearly texts containing in "<...>", while searching in these articles,because if i wish to search for div,table etc. there should be no result returned?
2)Articles will be updated anytime,so full text index must be updated when a new record is inserted.Is it enough to set only "TRACK CHANGES AUTOMATIC" while creating full text catalog?
3)We may use FILESTREAM feature hereafter,does SQL Server 2008 have a good performance on files using full text index? What specific document types does SQL Server 2008 good on indexing?
Regards
Upvotes: 8
Views: 7027
Reputation: 393
there is a filter for .htm and .html files.
to see if you have the filter installed run this sql:
SELECT * FROM sys.fulltext_document_types
you should see:
.htm E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation
.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation
so, if you can convert your articles column to varbinary(max), then you can add a full text index on it and specify a doc type of '.html'
once the index has populated, you can verify the keywords using this sql:
SELECT display_term, column_id, document_count
FROM sys.dm_fts_index_keywords
(DB_ID('your_db'), OBJECT_ID('your_table'))
Upvotes: 28
Reputation: 2268
Please check for these:
1) In SQL Server Full Text, we can define noise words/Stopwords. You can edit the Noise world file and then you have to rebuild the catalog. So you can put all the html tags as noise. Please check
http://msdn.microsoft.com/en-us/library/ms142551.aspx
2) With track changes it automatically include the changes in current full text search, but the ranking of these newly added article gets changed from the previous. So until and unless you master index is synced it will give up and down with ranking.
3) As far as i know we can implement custom filters, stemmers and word breakers and can plug into SQL Server full text search.By default i may not know the complete list, but it does doc and pdf.
For more information on SQL Server full text search 2008 please check:
http://technet.microsoft.com/en-us/library/cc721269.aspx
Upvotes: -1