Reputation: 61
In a Rails 3.2.x app, using (Re)tire to access an ES cluster a rake task is going through approx 1M rows to create a new index. (Ruby 1.9.3).
The task is using .to_json with specific attributes and methods listed to limit the resulting hash for each element. Yet as the task run the memory is eaten away, ending with the process being killed usually by the system.
The task is already using find_by_batch. Smaller batches sizes (using find_each) don't help.
Removing the index.import call does improve things (obviously). The task goes through the whole collection very fast without a problem. Pointing to either ES, tire or the JSON conversion (and the relations it might call upon).
Adding back index.import and passing a very limited hash (with string keys) for each item does make things slower but not too much and does not eat memory away. So json might no be the culprit here.
The culprit seems to be one of the method used to grab one of the additional attributes. It's based on a relation of the model and another ... Ending up with a lot of models being involved and sifted through.
As pointed out by Index the results of a method in ElasticSearch (Tire + ActiveRecord) adding includes does help a bit but the task does end up heavy too.
I also tried to go around part of the problem and replace the calls to Tire with the use of ES bulk API. Generating json files and sending them with a Ruby http lib can work. Yet, the same problem arise : memory since the same requests to the DB are made.
What I don't get is why even with the find_by_batch Ruby keeps eating away memory. I would expect that after each batch of data, memory related that batch would be freed.
Next to try : GC.start calls, Active Record caching de activation around the tasks.
Yet, except if a solution limiting the memory use drastically (300 or 500Mo instead of 800+) the background issue is : indexing a lot of instances of a Model including data related to some other models.
Upvotes: 2
Views: 298
Reputation: 3035
I've been using Rails + Elasticsearch for a while and did this kind of dance a few times. A few things comes to mind, in no particular order.
Did you try to use the recent elasticsearch
gem (instead of tire
) ? I've updated my apps to use and like having more control on what is done.
I would also try to force a GC sweep after each ActiveRecord loop. You could also be extra careful with memory allocation by explicitly resetting all local variables each time.
You could use the fork & exec
trick to fork a brand new process at each loop, it would be the most effective GC you can get. It's a little overhead when you write it the first time, but the pay-off is great. Take good care of limiting the amount of memory used in the outer part of the task. Using a process-based background task would partly achieve the same goal, but you might still get memory bloat.
Can you limit the use of ActiveRecord? If you need some basic associations you could use a lower-level/simpler tool like Sequel (or else) to use Ruby hashes/arrays instead of full fledged AR models.
Upvotes: 1