Nathan W
Nathan W

Reputation: 516

Loading Elasticsearch via Logstash on large dataset runs very slowly

I have a large dataset in MySql (around 2.2 million rows) and my importing to Elasticsearch via Logstash works, but now is going incredibly slowly.

On my local machine in vagrant instances with 4GB RAM, each, it went relatively quickly (took 3 days) compared to taking an estimate 80+ days for a server-to-server transfer.

The query is quite complex (using a subquery, etc).

I switched the mysql server from using the the /tmp directory to using the /data/tmp_mysqldirectory, but even then I was occasionally running out of temporary space. when I switched to the /data/tmp_mysql directory to hold the /tmp files instead of the /tmp directory.

e.g: I was getting the error:

message=>"Exception when executing JDBC query, exception Sequel::DatabaseError: Java::JavaSql::SQLException Error writing file '/data/tmp_mysql/MYHPf8X5' (Errcode: 28)

I updated my query to have this limit (200): UPDATE p_results set computed_at="0000-00-00 00:00:00" WHERE computed_at IS NULL LIMIT 200;

My configuration file looks like this: (notice that I'm using paging with a page size of 10000).

    input {
    jdbc {
        jdbc_connection_string => "jdbc:mysql://xxx.xxx.xxx.xxx:3306/xxx_production"
        jdbc_user => "xxx"
        jdbc_password => "xxx"
        jdbc_driver_library => "/usr/share/java/mysql.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        statement_filepath => "./sql/req.sql"
        jdbc_paging_enabled => "true"
        jdbc_page_size => 10000
    }
}
    output {
        elasticsearch {
            index => "xxx_resultats_preprod2"
            document_type => "resultats"
            hosts => ["localhost:9200"]
            codec => "plain"
            template => "./resultats_template.json"
            template_name => "xxx_resultats"
            template_overwrite => true
            document_id => "%{result_id}"
        }
    }

I've looked at some of the documentation here

running free -m on my logstash/elasticsearch server, I see this:

 total   used   free shared  buffers  cached

Mem: 3951 2507 1444 0 148 724

-/+ buffers/cache: 1634 2316

Swap: 4093 173 3920

So total RAM= 4GB, and 2.5GB or 63.4% of it is used. So Ram on the Elasticsearch server doesn't seem to be the issue.

running free -m on my MySql server I see this:

         total       used       free     shared    buffers     cached

Mem: 3951 3836 115 0 2 1154

-/+ cache: 2679 1271

swap: 4093 813 3280

So total Ram = 4GB and ~3,8GB or 97% is used. This looks like a problem.

My theories are that I'm occasionally swapping to disk and that is part of the reason why it's slow. Or maybe I'm using BOTH paging and a limit and that's slowing things down?

The load average on the Mysql server is relatively low right now.

top

load average: 1,00, 1,00, 1,00

under /data I see:

sudo du -h -d 1

13G ./tmp_mysql

4,5G ./production

using df-h I see: total used utilization% /dev/sdb1 32G 6,2G 24G 21% /data

If someone can help me make my queries execute much faster I'd very much appreciate it!

Edit: Thank you all for your helpful feedback. It turns out my logstash import had crashed (due to running out of /tmp space in Mysql for the subquery), and I assumed that I could just keep running the same import job. Well, I could run it, and it loaded into elastic, but very, very slowly. When I re-implemented the loading of the index entirely and started running it on a new index, the load time became pretty-much on par with what it was in the past. I estimate it will take 55 hours to load the data - which is a long time, but at least it's working reasonably now.

I did an EXPLAIN on my Mysql subquery and found some indexing issues I could address/improve, too.

Upvotes: 0

Views: 1683

Answers (1)

Thomas Decaux
Thomas Decaux

Reputation: 22651

You indicate 2 potential problems here:

  • slow mysql read
  • slow elasticsearch write

You must eliminate one! Try to output on stdout to see if elasticsearch is the bottleneck or not.

If yes, you can play with some ES settings to improve ingestion:

  • refresh_interval => -1 (disable refresh)
  • remove replica when doing the import (number_of_replicas:0)
  • Use more shards and more nodes

(more at https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html)

Upvotes: 1

Related Questions