Reputation: 23
I've created a document archive management system in .NET stack. Search capabilities are limited for now. When user do a search, select related customer and query "title", "definition" or "date" fields. They are too many records (about 5 million). I dont't have any problem searching theese fields.(BTW database is Sql Server.)
We attaching pdf or office files to record. If user attach a file to record, I save the file to filesystem and write file path to database. If there is a attached file(s) in query results, user can open the document click path.
We want to index theese attached documents and search in index. But i need to create index per customer.
In summary what I want to;
I know there are techcologies can do this. Lucene/Solr, sphinx etc. But I confused and need a advice.
Upvotes: 1
Views: 1220
Reputation: 3209
It sounds like you're exploring options for:
I can give you some pointers in the Apache stack (since you mentioned Lucene/Solr in your post):
Extract
A widely-used open-source document extraction tool is Apache Tika (which can handle PDFs and MS Office documents among others)
You mentioned you want to index images which, in a document management context probably includes OCR, yes? There is growing support for Google Tesseract OCR in the open-source community (Apache 2 licensed).
Apache Tika has recently included support for Tesseract OCR under the name TikaOCR.
Search
You're already using SQL Server, so if the fulltext support meets your needs, it may be simplest initially, just to use Tika to generate plain-text which can be added as another column (w/ a FULLTEXT index) to your document table in SQL Server.
Incorporating a Lucene-based search server (Apache Solr, or Elasticsearch) will greatly enhance your ability to tune search and expose best-practice search features (autocomplete, search facets, similar search).
Lucene.NET is another solution (C# library), but it hasn't kept pace with the Lucene Java project (last update 2012). Additionally with 5MM+ documents, you'd be wise to consider an out-of-process server-based search solution.
Multi-tenant strategy
Ultimately you have three main options:
Ultimately you can implement any of the three approaches using SQL Server (current implementation is #1, correct?).
With any search solution, #3 - separate datastores may become operationally cost-prohibitive as your client list grows (unless there is a hard requirement to firewall off organizational data).
So many multi-tenant search applications use #2 - shared datastore, separate search indexes, or #1 - shared datastore, shared index depending on requirements.
Both Solr and Elasticsearch will enable you to set up one document index/collection per client (#2), or manage one big multi-client collection with, say, a client_id
field for filtering (#1).
With a commercial Elasticsearch plug-in (Shield) it is possible to provide index-level security, so that e.g. each separate client-facing .NET application only could access that client's document index (#2 strategy above).
Integration
You're working in .NET, and may not want to wrangle Java libraries. Both Solr and Elasticsearch operate as servers with HTTP APIs for search and ingest. Solr has an Apache Tika integration called Solr CELL, as does Elasticsearch via the elasticsearch-mapper-attachments plugin project (both of which would insulate you from Java development).
However, Elasticsearch's Tika integration does not yet support Tesseract OCR (Solr's integration does).
.NET clients for Elasticsearch (NEST is getting a lot of use)
Scaling Considerations
OCR processing and text-extraction is CPU-intensive, therefore as your ingest volume grows, you may ultimately want to consider processing documents on dedicated machines which are not used for search.
In summary, assuming you need OCR, any of these ingest/search stacks could work:
I hope this gives you a starting point for your solution!
Upvotes: 3