mshirdel
mshirdel

Reputation: 88

How to save documents like PDF,Docx,xls in sql server 2008

I develop a web application that let users to upload files like images and documents. this file divided into two parts :

  1. binary files
  2. document files

I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?

Upvotes: 3

Views: 4749

Answers (3)

Baz1nga
Baz1nga

Reputation: 15579

You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.

For more information look at this:

http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx

Pros: The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.

also you can enable versioning of files and also make it easier for load balanced web farms.

Cons: you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.

Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.

Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

Upvotes: 2

Pars.Engineer
Pars.Engineer

Reputation: 67

The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.

In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.

these are two article about it:

SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search

Using Full Text Search in SQL Server 2008

Upvotes: 0

JustBeingHelpful
JustBeingHelpful

Reputation: 18980

You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.

For docx files, I would convert them to RTF and search them that way while stored in SQL.

For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.

Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.

Upvotes: 0

Related Questions