gourav vijayvargiya
gourav vijayvargiya

Reputation: 21

How to capture data change in aws glue?

We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?

Note- We don't want to use AWS DMS.

Upvotes: 1

Views: 7009

Answers (3)

Yenteh Liu
Yenteh Liu

Reputation: 21

You can leverage AWS DMS for CDC and then use the Apache IceBerg connections with Glue Data Catalog to achieve this: https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/

Upvotes: 1

Trust Okoroego
Trust Okoroego

Reputation: 98

It's not possible to implement a change data capture through direct glue data extraction. While a Job bookmark can help you identify inserts and updates if your table contains an update_at timestamp column, it won't cover delete cases. You actually need a CDC solution.

While AWS glue direct connection to a database source is a great solution, I strongly discourage using it for incremental data extraction due to the cost implication. It's like using a Truck to ship one bottle of table water.

As you already commented, I am not also a fan of AWS DMS, but for a robust CDC solution, a tool like Debezium could be a perfect solution. It integrates with kafka and Kinesis. You can easily sink the stream to s3 directly. Debezium gives you the possibility to capture deletes and append a special boolean __delete column to your data, so your glue etl can manage the removal of these deleted records with this field.

Upvotes: 0

Bob Haffner
Bob Haffner

Reputation: 8493

I'm only aware of Glue Bookmarks. They will help you with the new records (Inserts), but won't help you with the Updates and Deletes that you typically get with a true CDC solution.

Not sure of your use case, but you could check out the following project. It has a pretty efficient diff feature and, with the right options, can give you a CDC-like output

https://github.com/G-Research/spark-extension/blob/master/DIFF.md

Upvotes: 0

Related Questions