Scooter
Scooter

Reputation: 1079

Architecture for tracking data changes in application DB required for warehousing

Overview

I have an OLTP DB that stores application transaction data and acts as the source of truth for the current state of the application. I would like to have my DWH store historical data so I can do analyses that compare previous states of the application to the current state. In Kimball terminology, I would like Type 2 dimensions for my SCDs.

In the application DB, changes to the dimensions are not tracked but rather updated in place for efficiency reasons. So, if I were to ETL directly from my application DB once per day, I would be losing historical data potentially. My question is how can I track these data changes for the sake of warehousing? Here are some options I have considered:

Option #1: Ingest Event Data Into Data Lake

Whenever anything happens in the application an event is emitted. These events can capture all the information I need for the warehouse. My thoughts are you could emit the events using something like Apache Kafka, and have some process listen for the events and store them in a raw form in a data lake that's completely immutable. Then, you would use an ETL process that works from the data lake instead of the application DB to load up the warehouse.

Pros

Cons

Option #2: Batch Process Application DB Snapshots

In this scenario, I would use the daily snapshots of the DB as a source for ETL'ing into the DWH. Historical data would be at the grain of how often the snapshot takes place (daily in this case). This would mean that an change data that happens within the day would be lost. However, it may not be that important to store such fine-grained data anyways.

Pros

Cons

Option #3: Batch Process Log Data

Instead of emitting an event to something like Apache Kafka, store the event data in a temporary log file on the application server. Then, the ETL process would involve scanning all the application servers to grab and process the log files. These log files would store extensive transaction history, enough to get all the data required for the DWH.

Pros

Cons

Upvotes: 0

Views: 316

Answers (0)

Related Questions