Neil Barnwell
Neil Barnwell

Reputation: 42095

Using Lucene to query an RDBMS database

I've skimmed the docs for the Java version of Lucene, but I can't really see the top-level "this is how it works" info so far (I'm aware I need to RTFM, I just can't see the wood for the trees).

I understand Lucene uses search indexes to return results. As far as I know, it only returns "hits" from those indexes. If I haven't added an item of data when building the index then it won't be returned.

That's fine, so now I want to check the following assumption:

Q: Does that mean that any data I want displayed on a search page needs to be added to the Lucene index?

I.e.
If I want to search for Products by things like sku, description, category name, etc, but I also want to display the Customer they belong to in search results, do I:

  1. Make sure the Lucene index has the denormalised Customer's name in the index.
  2. Use the hits returned by Lucene to somehow query the database for the actual product records and use a JOIN to get the Customer's name.

I assume it's option 1, since I'm assuming there's no way to "join" the results of a Lucene query to an RDBMS, but wanted to ask it my assumptions about the general usage are correct.

Upvotes: 4

Views: 1749

Answers (3)

Arshad Ansari
Arshad Ansari

Reputation: 334

I have been trying to figure out the same problem, but I think that its too much work. I'm thinking of this as an alternative. Plse correct me if I'm wrong in my thinking!

Your situation is like this: RDBMS product (many) <------> (many) Customer

Instead of putting only customer in lucene index to get product keys, and then query RDBMS with IN Query, I'd suggest, create the lucene index with the cartesian product of Product as well as Customer.

Like customer_1, product_1 customer_1, product_2 customer_2, product_2..

This way, when you are searching for a product in lucene, it will give both the customer as well as the products id.. and instead of joining them in RDBMS, you can simply look up those customers as well as products for more information from RDBMS, if there is a need. If you are using caching, then the additional details lookup cost will also go down.

Upvotes: 1

Neil Barnwell
Neil Barnwell

Reputation: 42095

Based on BrokenGlass's answer, I've thought some more and am proposing the following to see if I'm on the right lines:

Basically, taking option 2 further, one could do the following:

  1. Put only the data you want to search on into the Lucene index, plus some sort of key value (e.g. the PK of a table in your database).
  2. Query Lucene to get a list of hits.
  3. Using your data access layer of choice, build a query for your database that includes an IN (value [, value]) predicate.
  4. Get the results for that query from your database (which may well include JOINs to other tables).
  5. Put those results in a dictionary, using the PK of the resultset as the key.
  6. Iterate the Lucene hits again in order, pulling the items from the dictionary using the PK so you can build a list of results in the order that Lucene returned the hits (i.e. sorted by relevance).
  7. Display that "sorted" list of results to the user.

Of course steps 5 and 6 could be better, but for the sake of explanation I put that verbose method in my description. If the Lucene hits include some sort of "relevance" value, then you could attribute that to the resultset and perform a standard sort, but that's an exercise for the reader. :)

Could this be it?

Upvotes: 0

BrokenGlass
BrokenGlass

Reputation: 160852

Usually the index would only contain the fields you want to search on, not necessarily the ones you want to display. Indexes should be optimized to be as small as possible, to keep search performance good.

To be able to display more data add a field to your index that allows you to retrieve your full document/data, i.e. a unique key for your Product (product id?).

Upvotes: 1

Related Questions