Reputation: 288
Scenario : - I am trying to import a large data set from Oracle DB (Few Million records) into ElasticSearch. - I am able to import data using logstash jdbc input plugin.
Issue : - Its taking a very long time (5 hrs) to complete.
Upvotes: 5
Views: 7453
Reputation: 861
You can use
:sql_last_value
option to get data from the database incrementally with a scheduler. Something like following will be helpful. (Assuming you have an ID field)
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
jdbc_user => "user"
jdbc_password => "password"
jdbc_driver_library => ".......\ojdbc8.jar"
jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
statement => "select * from (select * from TABLE where id >= :sql_last_value ORDER BY id ASC) rownum < 20"
use_column_value => true
tracking_column => id
tracking_column_type => "numeric"
schedule => "* * * * *"
}
}
update: Refactored SQL to use rownum
and ORDER BY
to limit the results in the intended order (sort before limiting). See: On ROWNUM and Limiting Results
Same result can be achieved using pagination but it has performance issues. If we use pagination ex:
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
jdbc_user => "user"
jdbc_password => "password"
jdbc_driver_library => ".......\ojdbc8.jar"
jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
jdbc_page_size => 20
jdbc_paging_enabled => true
statement => "select * from TABLE"
}
}
But this will wrap
"select * from TABLE"
like
SELECT * FROM (SELECT "T1".*, ROWNUM "X_SEQUEL_ROW_NUMBER_X" FROM (SELECT * FROM (select * from TABLE) "T1") "T1") "T1" WHERE (("X_SEQUEL_ROW_NUMBER_X" > 0) AND ("X_SEQUEL_ROW_NUMBER_X" <= (0 + 20)));
and will run without a scheduler by dividing the query based on the specified jdbc_page_size (20 for this example). But this method obviously has performance issues.
Upvotes: 2
Reputation: 3
You can put more than jdbc input plugin in the same config file .. It works fine with me.
i.e:
input { jdbc { ... }
jdbc {
...
}
}
Upvotes: -1
Reputation: 1256
You can play with jdbc_fetch_size
first, the next step will be running import on several nodes, for each node split the sql query into nonoverlapping sets like select * from XX where id between 0 and 1000000
.
You can also set number of ES index replicas to 0 before indexing big bulk and change it back after. And of course use bulk indexing API with good batch size and parallelism.
Upvotes: 4