Reputation: 2437
I'm running a setup with django 1.4, Haystack 2 beta, and ElasticSearch .20. My database is postgresql 9.1, which has several million records. When I try to index all of my data with haystack/elasticsearch, the process times out and I get a message that just says "Killed". So far I've noticed the following:
haystack/backends/__init__.py
and that seems to have no effect.If hardcoding the timeout doesn't work, then how else can I extend the time for indexing? Is there another way to change this directly in ElasticSearch? Or perhaps some batch processing method?
Thanks in advance!
Upvotes: 4
Views: 1685
Reputation: 2437
This version of haystack is buggy. The line of code causing the problem was found in the file haystack/management/commands/update_index.py in the following line:
pks_seen = set([smart_str(pk) for pk in qs.values_list('pk', flat=True)])
Is causing the server to run out of memory. However, for indexing, it does not seem to be needed. So, I just changed it to:
pks_seen = set([])
Now it's running through the batches. Thank you everyone that answered!
Upvotes: 2
Reputation: 1163
I'd venture that the issue is with generating the documents to send to ElasticSearch, and that using the batch-size
option will help you out.
The update
method in the ElasticSearch backend prepares the documents to index from each provided queryset and then does a single bulk insert for that queryset.
self.conn.bulk_index(self.index_name, 'modelresult', prepped_docs, id_field=ID)
So it looks like if you've got a table with millions of records, running update_index
on that indexed model will mean you need to generate those millions of documents and then index them. I would venture this is where the problem is. Setting a batch limit with the --batch-size
option should limit the documents generated by queryset slices of your batch size.
Upvotes: 6
Reputation: 15506
Have you watched the memory your process is consuming when you try to index all of those records? Typically when you see "Killed" it means that your system has run out of memory, and the OOM killer has decided to kill your process in order to free up system resources.
Upvotes: 1