Reputation: 1352
Given the following PostgreSQL table test
(some columns omitted, e.g. data
which is used in the pipeline):
id (uuid) | updated_at (timestamp with time zone)
652d88d3-e978-48b1-bd0f-b8188054a920 | 2018-08-08 11:02:00.000000
50cf7942-cd18-4730-a65e-fc06f11cfd1d | 2018-08-07 15:30:00.000000
Given Logstash 6.3.2 (via Docker) with the following pipeline (jdbc_*
omitted):
input {
jdbc {
statement => "SELECT id, data, updated_at FROM test WHERE updated_at > :sql_last_value"
schedule => "* * * * *"
use_column_value => true
tracking_column => "updated_at"
tracking_column_type => "timestamp"
}
}
filter {
mutate { remove_field => "updated_at" }
}
output {
elasticsearch {
hosts => "elasticsearch:9200"
index => "test"
document_id => "%{id}"
}
}
When this pipeline runs the very first time (or with clean_run => true
) I'd expect it to process both database rows (because sql_last_value
is 1970-01-01 00:00:00.000000
) and set the value of the tracking column stored in .logstash_jdbc_last_run
to 2018-08-08 11:02:00.000000000 Z
(= the latest of all found updated_at
timestamps). It'll be set to 2018-08-07 15:30:00.000000000 Z
though, which is the earlier of the two given timestamps. This means that in the 2nd run the other of the two rows will be processed again, even if it hasn't changed.
Is this the expected behaviour? Do I miss some other configuration which controls this aspect?
It seems that the updated_at
of the very last row returned will be used (just tried it with more rows). So I'd have to add an ORDER BY updated_at ASC
which I believe isn't that great in terms of DB query performance.
sh-4.2$ cat .logstash_jdbc_last_run
cat: .logstash_jdbc_last_run: No such file or directory
[2018-08-09T14:38:01,540][INFO ][logstash.inputs.jdbc ] (0.001254s) SELECT id, data, updated_at FROM test WHERE updated_at > '1970-01-01 00:00:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-07 15:30:00.000000000 Z
[2018-08-09T14:39:00,335][INFO ][logstash.inputs.jdbc ] (0.001143s) SELECT id, data, updated_at FROM test WHERE updated_at > '2018-08-07 15:30:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-08 11:02:00.000000000 Z
[2018-08-09T14:40:00,104][INFO ][logstash.inputs.jdbc ] (0.000734s) SELECT id, data, updated_at FROM test WHERE updated_at > '2018-08-08 11:02:00.000000+0000'
sh-4.2$ cat .logstash_jdbc_last_run
--- 2018-08-08 11:02:00.000000000 Z
Upvotes: 2
Views: 6368
Reputation: 2989
I have experiencing the same problem from last month using MySQL to ES. But at the end it is solved. The file .logstash_jdbc_last_run is created in your home directory by default. You can change the path of this file by setting the last_run_metadata_path config option. I am using UTC date format.
Firt time the sql_last_value value is 1970-01-01 00:00:00.000000 . Also it set the date in logstash_jdbc_last_run file which is first record returned by MySQL. That why i use order by update_at DESC. Following code worked for me.
input {
jdbc {
jdbc_default_timezone => "UTC"
statement => "SELECT id, data, DATE_FORMAT(updated_at, '%Y-%m-%d %T') as updated_at, FROM test WHERE updated_at > :sql_last_value order by update_at DESC"
schedule => "* * * * * *"
use_column_value => true
tracking_column => "updated_at"
tracking_column_type => "timestamp"
last_run_metadata_path => /home/logstash_track_date/.logstash_user_jdbc_last_run"
}
}
filter {
mutate { remove_field => "updated_at" }
}
output {
stdout { codec => rubydebug }
elasticsearch {
hosts => "elasticsearch:9200"
index => "test"
document_id => "%{id}"
}
}
Upvotes: 2