Reputation: 54148
I got some documents (size about 300o/doc
) that I'd like to insert in my ES index using the python lib, I got huge time difference between the code and using curl
it's obvious that it's normal, but I'd like to know if time can be improved (compared to the ratio of time)
curl
option takes about 20sec to insert and whole time 10sec (for printing ES result but after 20sec data is inserted)
curl -H "Content-Type: application/json" -XPOST
"localhost:9200/contentindex/doc/_bulk?" --data-binary @superfile.bulk.json
With python
option, I reached 1min20 as minimum, using the setting 10000/16/16
(chunk/thread/queue
)
import codecs
from collections import deque
from elasticsearch import Elasticsearch
from elasticsearch.helpers import parallel_bulk
es = Elasticsearch()
def insert_data(filename, indexname):
with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as fic:
for line in fic:
json_line = {}
json_line["data1"] = "random_foo_bar1"
json_line["data2"] = "random_foo_bar2"
# more fields ...
yield {
"_index": indexname,
"_type": "doc",
"_source": json_line
}
if __name__ == '__main__':
pb = parallel_bulk(es, insert_data("superfile.bulk.json", "contentindex"),
chunk_size=10000, thread_count=16, queue_size=16)
deque(pb, maxlen=0)
Facts
[100-50000]/[2-24]/[2-24]
Questions
Can I still improve the time ?
If not, should I think of a way to write the data on a file and then use a process for curl
command ?
If I try only the parse part it takes 15sec :
tm = time.time()
array = []
pb = insert_data("superfile.bulk.json", "contentindex")
for p in pb:
array.append(p)
print(time.time() - tm) # 15
pb = parallel_bulk(es, array, chunk_size=10000, thread_count=16, queue_size=16)
dequeue(pb, maxlen = 0)
print(time.time() - tm) # 90
Upvotes: 7
Views: 10137
Reputation: 2146
After my testing:
curl working more faster than python client, obviously curl implemented better.
After more testing and playing with parameters I can conclude:
My best approach was on 8 threads and 10000 items chunk. This depends on the configuration of index.index_concurrency that 8 by default.
I think that using the multinode cluster with separate master node should improve performance.
For more information, you can read a great 2 part article I found: here and here
Upvotes: 9