ElasticSearch types and indexing performance

Question

I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?

If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.

The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:

I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes

I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.

Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?

Any input would be appreciated.

EDIT

To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.

I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).

ElasticSearch types and indexing performance

Answers (1)

Related Questions