Architecture for tracking data changes in application DB required for warehousing
Overview
I have an OLTP DB that stores application transaction data and acts as the source of truth for the current state of the application. I would like to have my DWH store historical data so I can do analyses that compare previous states of the application to the current state. In Kimball terminology, I would like Type 2 dimensions for my SCDs.
In the application DB, changes to the dimensions are not tracked but rather updated in place for efficiency reasons. So, if I were to ETL directly from my application DB once per day, I would be losing historical data potentially. My question is how can I track these data changes for the sake of warehousing? Here are some options I have considered:
Option #1: Ingest Event Data Into Data Lake
Whenever anything happens in the application an event is emitted. These events can capture all the information I need for the warehouse. My thoughts are you could emit the events using something like Apache Kafka, and have some process listen for the events and store them in a raw form in a data lake that's completely immutable. Then, you would use an ETL process that works from the data lake instead of the application DB to load up the warehouse.
Pros
- Can achieve real-time data analysis if required in the future (currently not necessary)
- An immutable data lake can act as a foundation for other types of analytics such as ML or other warehouses
- The data lake serves as a single source of truth for all data which will be nice for the future when there are multiple application DBs and other sources of data ingestion
Cons
- Requires an event processing/streaming service which is more overhead to maintain
- Data can be lost/duplicated causing the lake to not reflect the application DB
- Requires storing data in two places which is more developmental overhead
Option #2: Batch Process Application DB Snapshots
In this scenario, I would use the daily snapshots of the DB as a source for ETL'ing into the DWH. Historical data would be at the grain of how often the snapshot takes place (daily in this case). This would mean that an change data that happens within the day would be lost. However, it may not be that important to store such fine-grained data anyways.
Pros
- Data storage is not duplicated between two places
- No extra infrastructure is required as daily snapshots are already automatically obtained and stored in S3
- Data integrity is maintained because we're working directly with the current application state so we can stay in-sync better
Cons
- Requires a delta calculation against the previous snapshot to determine what new dimension objects need to be imported (this may actually be required in any scenario but seems more natural to do with the event architecture)
- The grain of the historical data is coupled to the frequency at which snapshots occur
- Only compatible with ETL'ing into DWH and would not be as useful for ML/data science applications that work well with raw data
Option #3: Batch Process Log Data
Instead of emitting an event to something like Apache Kafka, store the event data in a temporary log file on the application server. Then, the ETL process would involve scanning all the application servers to grab and process the log files. These log files would store extensive transaction history, enough to get all the data required for the DWH.
Pros
- Logging events to a file is easy from a developmental standpoint on the application server
- A full transaction history can be stored so we don't lose any information
- Minimal performance impact on application server
Cons
- Data reliability is lower than the other two options because application servers can be torn down at any moment to accomadate scaling which would lead to log files getting lost (also server can crash)
- Processing log data requires extra parsing logic
- ETL would need to work directly against server instances which may require service discovery