Ansonmus
Ansonmus

Reputation: 335

Ingest data from SQL Server using logstash - Batches or not and is "ORDER BY" in (SQL) statement required?

We are load product-data using Logstash from SQL Server to Elastic Search. Our products have a ChangeId which is a number, sequentially increased on updates on the product-data.

In our development process we sometimes 'change' all the records to reindex all data. I'm curious what the best way is to form the (SQL) statement. The underlying question is how much is handled by logstash or not.

To give some context, here a little, siplified part of the pipeline

use_column_value => true
tracking_column => "ChangeId"
statement => "
    SELECT * FROM [Uni].[TPProducts] 
    WHERE [ChangeId] > :sql_last_value"
lowercase_column_names => "false"

When all, say 20.000.000 records are changed, this query will be heavy due to JSON generation in the [Uni].[TPProducts] view.

I've these (somewhat open) questions:

  1. While running, we see growing the index and still the query active as process/session on SQL-server. This implies for me something is sort of streaming (which is desired). Maybe this is something related to the JDBC driver, but don't know how to find out the facts.

  2. The sql_last_value on disk is mostly not changed, while running, but one time we think we have seen the number was set to 1.000.000, while running. When we 'kill' the pipeline, data is ingested but the sql_last_value is not updated.

  3. Do we need to do an "order by" in this statement (ORDER BY [ChangeId]) or is this all managed by Logstash?

  4. Is working with batches (and consequently including the ORDER BY statement) a better approch for large sets?

Upvotes: 0

Views: 28

Answers (0)

Related Questions