azro
azro

Reputation: 54148

How to improve parallel_bulk from python code for elastic insert?

I got some documents (size about 300o/doc) that I'd like to insert in my ES index using the python lib, I got huge time difference between the code and using curl it's obvious that it's normal, but I'd like to know if time can be improved (compared to the ratio of time)

  1. curl option takes about 20sec to insert and whole time 10sec (for printing ES result but after 20sec data is inserted)

    curl -H "Content-Type: application/json" -XPOST 
            "localhost:9200/contentindex/doc/_bulk?" --data-binary @superfile.bulk.json 
    
  2. With python option, I reached 1min20 as minimum, using the setting 10000/16/16 (chunk/thread/queue)

    import codecs
    from collections import deque
    from elasticsearch import Elasticsearch
    from elasticsearch.helpers import parallel_bulk
    
    es = Elasticsearch()
    
    def insert_data(filename, indexname):
        with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as fic:
            for line in fic:        
                json_line = {}
                json_line["data1"] = "random_foo_bar1"
                json_line["data2"] = "random_foo_bar2"
                # more fields ...        
                yield {
                    "_index": indexname,
                    "_type": "doc",
                    "_source": json_line
                }
    
    if __name__ == '__main__':
     pb = parallel_bulk(es, insert_data("superfile.bulk.json", "contentindex"), 
                           chunk_size=10000, thread_count=16, queue_size=16)
     deque(pb, maxlen=0)
    

Facts

Questions


If I try only the parse part it takes 15sec :

tm = time.time()
array = []

pb = insert_data("superfile.bulk.json", "contentindex") 
for p in pb:
   array.append(p)
print(time.time() - tm)            # 15

pb = parallel_bulk(es, array, chunk_size=10000, thread_count=16, queue_size=16)
dequeue(pb, maxlen = 0)
print(time.time() - tm)              # 90

Upvotes: 7

Views: 10137

Answers (1)

ozlevka
ozlevka

Reputation: 2146

After my testing:

  1. curl working more faster than python client, obviously curl implemented better.

  2. After more testing and playing with parameters I can conclude:

    1. Elasticsearch index performance depends on the configuration of the index and the entire cluster. You can approach more performance by right mapping of fields into the index.
    2. My best approach was on 8 threads and 10000 items chunk. This depends on the configuration of index.index_concurrency that 8 by default.

    3. I think that using the multinode cluster with separate master node should improve performance.

    4. For more information, you can read a great 2 part article I found: here and here

Upvotes: 9

Related Questions