Reputation: 11

Using Lucene to index private data, should I have a separate index for each user or a single index

I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)

My questions revolve around performance and security.

Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?

Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.

Any insight would be greatly appreciated.

Regards,

Nate

Upvotes: 1

Answers (2)

Denis Bazhenov

Reputation: 9955

Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.

However there are several reasons why you would want to split indexes:

if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.

Upvotes: 0

Marko Bonaci

Reputation: 5706

Of course, one index. You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.

And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.

Upvotes: 2

Using Lucene to index private data, should I have a separate index for each user or a single index

Answers (2)

Related Questions