Reputation: 3659
I'm planning a database for a project that stores millions of documents about users and his machine logs.
The question is: how to store and shard this data? User-based or time-based?
Indexing by user I can fastly query million docs of ONE user and generate many time-based reports about him.
Indexing by time I can fastly query all users of ONE day and generate reports about them.
What is the best way to mine this data in both directions (user and time)?
I'm reading some about sharding, indexing and routing.
Upvotes: 2
Views: 838
Reputation: 11744
There's no simple rule of thumb to follow, as I emphasize in an article I wrote on Sizing Elasticsearch. It discusses various approaches to sharding and partitioning and other things to keep in mind. Pros and cons with user-based routing and time range partitioning are both covered.
As you indicate in the comment, your ingestion rate isn't very big, so e.g. an index per day can work well. But whether that's a good idea depends a lot on your searches. Are you typically just searching for the last few days, or will a user typically search his entire history? If so, then time-based partitioning might actually work against you, since you'll be searching over so many Lucene indexes.
The linked article references Shay's excellent talk on this topic as well: https://vimeo.com/44716955
Upvotes: 2
Reputation: 8733
How many docs a day will you be storing? You may be pre-optimizing.
One possible strategy (Time based indices, with user routing):
If you make each day an index you can limit any date based searches to only the indices that apply.
You could then route all docs by userid, thus any user based search would hit only the shards where data for that user existed.
Upvotes: 2