shan
shan

Reputation: 288

Logstash-JDBC plugin : Importing millions of records from Oracle to ElasticSearch

Scenario : - I am trying to import a large data set from Oracle DB (Few Million records) into ElasticSearch. - I am able to import data using logstash jdbc input plugin.

Issue : - Its taking a very long time (5 hrs) to complete.

Upvotes: 5

Views: 7453

Answers (3)

Erangad
Erangad

Reputation: 861

You can use

:sql_last_value

option to get data from the database incrementally with a scheduler. Something like following will be helpful. (Assuming you have an ID field)

input {
   jdbc {
      jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_library => ".......\ojdbc8.jar"
      jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
      statement => "select * from (select * from TABLE where id >= :sql_last_value ORDER BY id ASC) rownum < 20"
      use_column_value => true
      tracking_column => id
      tracking_column_type => "numeric"
      schedule => "* * * * *"
  } 

}

update: Refactored SQL to use rownum and ORDER BY to limit the results in the intended order (sort before limiting). See: On ROWNUM and Limiting Results

Same result can be achieved using pagination but it has performance issues. If we use pagination ex:

input {
   jdbc {
      jdbc_connection_string => "jdbc:oracle:thin:IP:PORT:SID"
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_library => ".......\ojdbc8.jar"
      jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
      jdbc_page_size => 20
      jdbc_paging_enabled => true
      statement => "select * from TABLE"
  } 
}

But this will wrap

"select * from TABLE"

like

SELECT * FROM (SELECT "T1".*, ROWNUM "X_SEQUEL_ROW_NUMBER_X" FROM (SELECT * FROM (select * from TABLE) "T1") "T1") "T1" WHERE (("X_SEQUEL_ROW_NUMBER_X" > 0) AND ("X_SEQUEL_ROW_NUMBER_X" <= (0 + 20)));

and will run without a scheduler by dividing the query based on the specified jdbc_page_size (20 for this example). But this method obviously has performance issues.

Upvotes: 2

Basem Mohammed
Basem Mohammed

Reputation: 3

You can put more than jdbc input plugin in the same config file .. It works fine with me.

i.e:

input { jdbc { ... }

   jdbc {
       ...
      }

}

Upvotes: -1

xeye
xeye

Reputation: 1256

You can play with jdbc_fetch_size first, the next step will be running import on several nodes, for each node split the sql query into nonoverlapping sets like select * from XX where id between 0 and 1000000.

You can also set number of ES index replicas to 0 before indexing big bulk and change it back after. And of course use bulk indexing API with good batch size and parallelism.

Upvotes: 4

Related Questions