Reputation: 23

Need advice about indexing documents(office docs,pdf)

I've created a document archive management system in .NET stack. Search capabilities are limited for now. When user do a search, select related customer and query "title", "definition" or "date" fields. They are too many records (about 5 million). I dont't have any problem searching theese fields.(BTW database is Sql Server.)

We attaching pdf or office files to record. If user attach a file to record, I save the file to filesystem and write file path to database. If there is a attached file(s) in query results, user can open the document click path.

We want to index theese attached documents and search in index. But i need to create index per customer.

In summary what I want to;

Index PDF, MS Office documents and images.
Create index for each customer. ( I think I need to do this. But if you have any idea please suggest)
Search in this index by keywords.

I know there are techcologies can do this. Lucene/Solr, sphinx etc. But I confused and need a advice.

Upvotes: 1

Answers (1)

Peter Dixon-Moses

Reputation: 3209

It sounds like you're exploring options for:

an Extract tool (to transform documents to plain text)
a Search tool (to make that plain text searchable)
Multi-tenant strategy (to manage these text collections for multiple clients)

I can give you some pointers in the Apache stack (since you mentioned Lucene/Solr in your post):

Extract

A widely-used open-source document extraction tool is Apache Tika (which can handle PDFs and MS Office documents among others)

You mentioned you want to index images which, in a document management context probably includes OCR, yes? There is growing support for Google Tesseract OCR in the open-source community (Apache 2 licensed).

Apache Tika has recently included support for Tesseract OCR under the name TikaOCR.

Search

You're already using SQL Server, so if the fulltext support meets your needs, it may be simplest initially, just to use Tika to generate plain-text which can be added as another column (w/ a FULLTEXT index) to your document table in SQL Server.

Incorporating a Lucene-based search server (Apache Solr, or Elasticsearch) will greatly enhance your ability to tune search and expose best-practice search features (autocomplete, search facets, similar search).

Lucene.NET is another solution (C# library), but it hasn't kept pace with the Lucene Java project (last update 2012). Additionally with 5MM+ documents, you'd be wise to consider an out-of-process server-based search solution.

Multi-tenant strategy

Ultimately you have three main options:

Shared datastore, shared table/index/collection
Shared datastore, separate tables/indices/collections
Separate datastores

Ultimately you can implement any of the three approaches using SQL Server (current implementation is #1, correct?).

With any search solution, #3 - separate datastores may become operationally cost-prohibitive as your client list grows (unless there is a hard requirement to firewall off organizational data).

So many multi-tenant search applications use #2 - shared datastore, separate search indexes, or #1 - shared datastore, shared index depending on requirements.

Both Solr and Elasticsearch will enable you to set up one document index/collection per client (#2), or manage one big multi-client collection with, say, a client_id field for filtering (#1).

With a commercial Elasticsearch plug-in (Shield) it is possible to provide index-level security, so that e.g. each separate client-facing .NET application only could access that client's document index (#2 strategy above).

Integration

You're working in .NET, and may not want to wrangle Java libraries. Both Solr and Elasticsearch operate as servers with HTTP APIs for search and ingest. Solr has an Apache Tika integration called Solr CELL, as does Elasticsearch via the elasticsearch-mapper-attachments plugin project (both of which would insulate you from Java development).

However, Elasticsearch's Tika integration does not yet support Tesseract OCR (Solr's integration does).

.NET clients for Solr

.NET clients for Elasticsearch (NEST is getting a lot of use)

Scaling Considerations

OCR processing and text-extraction is CPU-intensive, therefore as your ingest volume grows, you may ultimately want to consider processing documents on dedicated machines which are not used for search.

In summary, assuming you need OCR, any of these ingest/search stacks could work:

Java or Command-line Tika+Tesseract (ingest), SQL Server fulltext (search)
Solr w/Solr Cell Tika/Tesseract integration (ingest + search)
Command-line Tesseract OCR (for image OCR extract) + Elasticsearch w/elasticsearch-mapper-attachments plugin for Tika support (ingest + search)

I hope this gives you a starting point for your solution!

Upvotes: 3

Need advice about indexing documents(office docs,pdf)

Answers (1)

Related Questions