Markus Ratzer
Markus Ratzer

Reputation: 1352

Logstash JDBC tracking column value not latest timestamp

Database

Given the following PostgreSQL table test (some columns omitted, e.g. data which is used in the pipeline):

id (uuid)                            | updated_at (timestamp with time zone)
652d88d3-e978-48b1-bd0f-b8188054a920 | 2018-08-08 11:02:00.000000
50cf7942-cd18-4730-a65e-fc06f11cfd1d | 2018-08-07 15:30:00.000000

Logstash

Given Logstash 6.3.2 (via Docker) with the following pipeline (jdbc_* omitted):

input {
    jdbc {
        statement => "SELECT id, data, updated_at FROM test WHERE updated_at > :sql_last_value"
        schedule => "* * * * *"
        use_column_value => true
        tracking_column => "updated_at"
        tracking_column_type => "timestamp"
    }
}

filter {
    mutate { remove_field => "updated_at" }
}

output {
    elasticsearch {
        hosts => "elasticsearch:9200"
        index => "test"
        document_id => "%{id}"
    }
}

Problem

When this pipeline runs the very first time (or with clean_run => true) I'd expect it to process both database rows (because sql_last_value is 1970-01-01 00:00:00.000000) and set the value of the tracking column stored in .logstash_jdbc_last_run to 2018-08-08 11:02:00.000000000 Z (= the latest of all found updated_at timestamps). It'll be set to 2018-08-07 15:30:00.000000000 Z though, which is the earlier of the two given timestamps. This means that in the 2nd run the other of the two rows will be processed again, even if it hasn't changed.

Is this the expected behaviour? Do I miss some other configuration which controls this aspect?

Edit

It seems that the updated_at of the very last row returned will be used (just tried it with more rows). So I'd have to add an ORDER BY updated_at ASC which I believe isn't that great in terms of DB query performance.

Logs, etc.

sh-4.2$ cat .logstash_jdbc_last_run
cat: .logstash_jdbc_last_run: No such file or directory

[2018-08-09T14:38:01,540][INFO ][logstash.inputs.jdbc     ] (0.001254s) SELECT id, data, updated_at FROM test WHERE updated_at > '1970-01-01 00:00:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-07 15:30:00.000000000 Z


[2018-08-09T14:39:00,335][INFO ][logstash.inputs.jdbc     ] (0.001143s) SELECT id, data, updated_at FROM test WHERE updated_at > '2018-08-07 15:30:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-08 11:02:00.000000000 Z


[2018-08-09T14:40:00,104][INFO ][logstash.inputs.jdbc     ] (0.000734s) SELECT id, data, updated_at FROM test WHERE updated_at > '2018-08-08 11:02:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-08 11:02:00.000000000 Z

Upvotes: 2

Views: 6368

Answers (1)

Saif
Saif

Reputation: 2989

I have experiencing the same problem from last month using MySQL to ES. But at the end it is solved. The file .logstash_jdbc_last_run is created in your home directory by default. You can change the path of this file by setting the last_run_metadata_path config option. I am using UTC date format.

Firt time the sql_last_value value is 1970-01-01 00:00:00.000000 . Also it set the date in logstash_jdbc_last_run file which is first record returned by MySQL. That why i use order by update_at DESC. Following code worked for me.

input {
    jdbc {
         jdbc_default_timezone => "UTC"
         statement => "SELECT id, data, DATE_FORMAT(updated_at, '%Y-%m-%d %T') as updated_at, FROM test WHERE updated_at > :sql_last_value order by update_at DESC"
         schedule => "* * * * * *"
         use_column_value => true
         tracking_column => "updated_at"
         tracking_column_type => "timestamp"
         last_run_metadata_path => /home/logstash_track_date/.logstash_user_jdbc_last_run" 
    }
}

filter {
    mutate { remove_field => "updated_at" }
}

output {
    stdout { codec => rubydebug }
    elasticsearch {
        hosts => "elasticsearch:9200"
        index => "test"
        document_id => "%{id}"
    }
}

Upvotes: 2

Related Questions