Reputation: 31
I'm currently working on a project which has a Kafka instance plugged to a Postgres DB throught a Debezium source connector. The connector streams every update of every row on some selected tables. The message sent by Debezium to each topic contains the new values of each column of each row updated. The old values before the update are not included in the message, only the new ones.
Recently, we added another connector to this Kafka instance. It's an S3 sink connector that periodically flushes these topic messages to a bucket in JSON format.
Now, we want to use AWS to Athena to query all this data on S3. The problem is that a single row in the database can have multiple row versions in multiple S3 files, because each update generates a new version of that row.
What we want to do is query these data on S3 like they were stored in the original database. For an example, I want my Athena queries not be aware that there are multiple versions of the same row. Since updates on the database can occur at any time, I cant go back to the S3 file where the first version of the row was written and update it. I need some data transformation that makes transparent to the user querying S3 that there is a single version of that row, with the values of its latest update, just like in the database.
Is there anything we can use to achieve that? I read that AWS Glue may be of some help, but I was not able to find much content regarding this specific case. We have also tried to use SQL to always search for the latest version of each row, but that seems overly complicated for the large amount of queries we currently have.
Upvotes: 0
Views: 74