Sam
Sam

Reputation: 20486

Give reads priority over writes in Elasticsearch

I have an EC2 server running Elasticsearch 0.9 with a nginx server for read/write access. My index has about 750k small-medium documents. I have a pretty continuous stream of minimal writes (mainly updates) to the content. The speeds/consistency I receive with search is fine with me, but I have some sporadic timeout issues with multi-get (/_mget).

On some pages in my app, our server will request a multi-get of a dozen to a few thousand documents (this usually takes less than 1-2 seconds). The requests that fail, fail with a 30,000 millisecond timeout from the nginx server. I am assuming this happens because the index was temporarily locked for writing/optimizing purposes. Does anyone have any ideas on what I can do here?

A temporary solution would be to lower the timeout and return a user friendly message saying documents couldn't be retrieved (however they still would have to wait ~10 seconds to see an error message).

Some of my other thoughts were to give read priority over writes. Anytime someone is trying to read a part of the index, don't allow any writes/locks to that section. I don't think this would be scalable and it may not even be possible?

Finally, I was thinking I could have a read-only alias and a write-only alias. I can figure out how to set this up through the documentation, but I am not sure if it will actually work like I expect it to (and I'm not sure how I can reliably test it in a local environment). If I set up aliases like this, would the read-only alias still have moments where the index was locked due to information being written through the write-only alias?

I'm sure someone else has come across this before, what is the typical solution to make sure a user can always read data from the index with a higher priority over writes. I would consider increasing our server power, if required. Currently we have 2 m2x-large EC2 instances. One is the primary and the replica, each with 4 shards.

An example dump of cURL info from a failed request (with an error of Operation timed out after 30000 milliseconds with 0 bytes received):

{
   "url":"127.0.0.1:9200\/_mget",
   "content_type":null,
   "http_code":100,
   "header_size":25,
   "request_size":221,
   "filetime":-1,
   "ssl_verify_result":0,
   "redirect_count":0,
   "total_time":30.391506,
   "namelookup_time":7.5e-5,
   "connect_time":0.0593,
   "pretransfer_time":0.059303,
   "size_upload":167002,
   "size_download":0,
   "speed_download":0,
   "speed_upload":5495,
   "download_content_length":-1,
   "upload_content_length":167002,
   "starttransfer_time":0.119166,
   "redirect_time":0,
   "certinfo":[

   ],
   "primary_ip":"127.0.0.1",
   "redirect_url":""
}

Upvotes: 2

Views: 770

Answers (1)

Sam
Sam

Reputation: 20486

After more monitoring using the Paramedic plugin, I noticed that I would get timeouts when my CPU would hit ~80-98% (no obvious spikes in indexing/searching traffic). I finally stumbled across a helpful thread on the Elasticsearch forum. It seems this happens when the index is doing a refresh and large merges are occurring.

Merges can be throttled at a cluster or index level and I've updated them from the indicies.store.throttle.max_bytes_per_sec from the default 20mb to 5mb. This can be done during runtime with the cluster update settings API.

PUT /_cluster/settings HTTP/1.1
Host: 127.0.0.1:9200

{
    "persistent" : {
        "indices.store.throttle.max_bytes_per_sec" : "5mb"
    }
}

So far Parmedic is showing a decrease in CPU usage. From an average of ~5-25% down to an average of ~1-5%. Hopefully this can help me avoid the 90%+ spikes I was having lock up my queries before, I'll report back by selecting this answer if I don't have any more problems.

As a side note, I guess I could have opted for more balanced EC2 instances (rather than memory-optimized). I think I'm happy with my current choice, but my next purchase will also take more CPU into account.

Upvotes: 2

Related Questions