elasticsearch - planning data mining and metrics

Question

I'm planning a database for a project that stores millions of documents about users and his machine logs.

The question is: how to store and shard this data? User-based or time-based?

Indexing by user I can fastly query million docs of ONE user and generate many time-based reports about him.

Indexing by time I can fastly query all users of ONE day and generate reports about them.

--

What is the best way to mine this data in both directions (user and time)?

I'm reading some about sharding, indexing and routing.

Alex Brasetvik · Accepted Answer

There's no simple rule of thumb to follow, as I emphasize in an article I wrote on Sizing Elasticsearch. It discusses various approaches to sharding and partitioning and other things to keep in mind. Pros and cons with user-based routing and time range partitioning are both covered.

As you indicate in the comment, your ingestion rate isn't very big, so e.g. an index per day can work well. But whether that's a good idea depends a lot on your searches. Are you typically just searching for the last few days, or will a user typically search his entire history? If so, then time-based partitioning might actually work against you, since you'll be searching over so many Lucene indexes.

The linked article references Shay's excellent talk on this topic as well: https://vimeo.com/44716955

elasticsearch - planning data mining and metrics

--

Answers (2)

Related Questions