Reputation: 31
I run a benchmark on elasticsearch using elasticsearch-php. I compare the time taken by 10 000 index one by one vs 10 000 with bulk of 1 000 documents.
On my vpn server 3 cores 2 Gb mem the performance is quite the same with or without bulk index.
My php code (inspired by à post):
<?php
set_time_limit(0); // no timeout
require 'vendor/autoload.php';
$es = new Elasticsearch\Client([
'hosts'=>['127.0.0.1:9200']
]);
$max = 10000;
// ELASTICSEARCH BULK INDEX
$temps_debut = microtime(true);
for ($i = 0; $i <= $max; $i++) {
$params['body'][] = array(
'index' => array(
'_index' => 'articles',
'_type' => 'article',
'_id' => 'cle' . $i
)
);
$params['body'][] = array(
'my_field' => 'my_value' . $i
);
if ($i % 1000) { // Every 1000 documents stop and send the bulk request
$responses = $es->bulk($params);
$params = array(); // erase the old bulk request
unset($responses); // unset to save memory
}
}
$temps_fin = microtime(true);
echo 'Elasticsearch bulk: ' . round($i / round($temps_fin - $temps_debut, 4)) . ' per sec <br>';
// ELASTICSEARCH WITHOUT BULK INDEX
$temps_debut = microtime(true);
for ($i = 1; $i <= $max; $i++) {
$params = array();
$params['index'] = 'my_index';
$params['type'] = 'my_type';
$params['id'] = "key".$i;
$params['body'] = array('testField' => 'valeur'.$i);
$ret = $es->index($params);
}
$temps_fin = microtime(true);
echo 'Elasticsearch One by one : ' . round($i / round($temps_fin - $temps_debut, 4)) . 'per sec <br>';
?>
Elasticsearch bulk: 1209 per sec Elasticsearch One by one : 1197per sec
Is there something wrong on my bulk index to obtain better performance ?
Thank's
Upvotes: 3
Views: 9534
Reputation: 11597
Replace:
if ($i % 1000) { // Every 1000 documents stop and send the bulk request
with:
if (($i + 1) % 1000 === 0) { // Every 1000 documents stop and send the bulk request
or you will query for each non-0 value (that is 999 of 1000)...
Obviously, this only works if $max
is a multiple of 1000.
Also, correct this bug:
for ($i = 0; $i <= $max; $i++) {
will iterate over $max + 1
items. replace it with:
for ($i = 0; $i < $max; $i++) {
There might also be a problem with how you initialize $params
. Shouldn't you set it up outside of the loop and only clean-up the $params['body']
after each ->bulk()
? When you reset with $params = array();
you loose all of it.
Also, remember that ES may be distributed over a cluster. Bulk operations can then be distributed to even the workload. So some performance scaling is not visible on a single physical node.
Upvotes: 4