Reputation: 917
I've read the docs on bulk upload
I generated a file with 1 json per line that's about 160 gigs. The bulk upload it looks like I'd have to put some kind of schema info before each line inserted, which would blow up the size of the file.
To bulk upload my 1 json per line file, I use gnu parallel to post with curl
cat out.json | parallel -j 32 --pipe -N1 curl -XPOST 'http://localhost:9200/xxx/yyy' --data-binary @-
This is really slow though. I could also run the job on a machine that has an SSD, take snapshot, then load onto server without one. What techniques do you use for fastest bulk upload?
Upvotes: 1
Views: 275
Reputation: 4803
I think you can take a look at logstash with json input file. Input filter plugin configs. Since logstash also says that logstash is faster as long as services are fast.Since reading json is very heavy filesystem operation (disk io cost will be high).
If your cluster suffers from write performance.consider adding a buffer queue for holding data.If your data feed to ES exceeds the Elasticsearch cluster’s ability to ingest the data, you can use a message queue as a buffer. By default, Logstash throttles incoming events when indexer consumption rates fall below incoming data rates. Since this throttling can lead to events being buffered at the data source, preventing backpressure with message queues becomes an important part of managing your deployment.
Upvotes: 1