How to improve parallel_bulk from python code for elastic insert?

Question

I got some documents (size about 300o/doc) that I'd like to insert in my ES index using the python lib, I got huge time difference between the code and using curl it's obvious that it's normal, but I'd like to know if time can be improved (compared to the ratio of time)

curl option takes about 20sec to insert and whole time 10sec (for printing ES result but after 20sec data is inserted)

curl -H "Content-Type: application/json" -XPOST 
        "localhost:9200/contentindex/doc/_bulk?" --data-binary @superfile.bulk.json

With python option, I reached 1min20 as minimum, using the setting 10000/16/16 (chunk/thread/queue)

import codecs
from collections import deque
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk

es = Elasticsearch()

def insert_data(filename, indexname):
    with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as fic:
        for line in fic:        
            json_line = {}
            json_line["data1"] = "random_foo_bar1"
            json_line["data2"] = "random_foo_bar2"
            # more fields ...        
            yield {
                "_index": indexname,
                "_type": "doc",
                "_source": json_line
            }

if __name__ == '__main__':
 pb = parallel_bulk(es, insert_data("superfile.bulk.json", "contentindex"), 
                       chunk_size=10000, thread_count=16, queue_size=16)
 deque(pb, maxlen=0)

Facts

I got a machine with 2 processors xeon 8-core and 64GB ram
I tried multiple values for each [100-50000]/[2-24]/[2-24]

Questions

Can I still improve the time ?
If not, should I think of a way to write the data on a file and then use a process for curl command ?

If I try only the parse part it takes 15sec :

tm = time.time()
array = []

pb = insert_data("superfile.bulk.json", "contentindex") 
for p in pb:
   array.append(p)
print(time.time() - tm)            # 15

pb = parallel_bulk(es, array, chunk_size=10000, thread_count=16, queue_size=16)
dequeue(pb, maxlen = 0)
print(time.time() - tm)              # 90

ozlevka · Accepted Answer

After my testing:

curl working more faster than python client, obviously curl implemented better.
After more testing and playing with parameters I can conclude:
1. Elasticsearch index performance depends on the configuration of the index and the entire cluster. You can approach more performance by right mapping of fields into the index.
2. My best approach was on 8 threads and 10000 items chunk. This depends on the configuration of index.index_concurrency that 8 by default.
3. I think that using the multinode cluster with separate master node should improve performance.
4. For more information, you can read a great 2 part article I found: here and here

How to improve parallel_bulk from python code for elastic insert?

Answers (1)

Related Questions