Reputation: 2807
I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.
The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:
I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
Any input would be appreciated.
EDIT
To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.
I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).
Upvotes: 1
Views: 1340
Reputation: 55233
Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
No, types are all lumped together into one index as you guessed.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
The total number of documents in the index is obviously a factor. Whether has_child
queries are slow in particular is another question - try comparing the performance of has_child
queries with trivial term
queries for example. The has_child
documentation offers one clue under "memory considerations":
With the current implementation, all
_id
values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.
This would imply a large amount of memory is required for any has_child
query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child
.
Upvotes: 1