Reputation: 8226
I am developing an advanced search engine using .Net where users can build their query based on several Fields:
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.
Upvotes: 4
Views: 2619
Reputation: 3209
Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
2) Executing bounded full-text search via Lucene.NET
3) Materializing result details via MS-SQL
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
Upvotes: 10