Reputation: 73
If I have a Magritte ingestion that is set to append, will it detect if rows are deleted in the source data? Will it also delete the rows in the ingested dataset?
Upvotes: 5
Views: 620
Reputation: 787
If this is a file-based ingetsion (as opposed to JDBC) magritte ingestion operates on files not on rows. If your transaction type for the ingestion is set to UPDATE, and you make changes to the file, including deleting rows, then when the ingestion runs the new file will completely replace the existing file in that dataset, so any changes made in the file will be reflected in the dataset.
Two additional notes:
Upvotes: 1
Reputation: 1747
For your first question on if deletions are detected, this will depend on the database implementation you are extracting from (I'll assume this is JDBC for this answer). If this shows up as a modification and therefore a new row, then yes your deletes will show up.
This would look something like the following at first:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
Followed by some updates (inside a subsequent run, incremental on update_ts
:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
Now your database would have to explicitly mark any DELETE
rows and increment the update_ts for this to be brought in:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | DELETE | 2 |
After this, you would then be able to detect the deleted records and adjust accordingly. Your full materialized table view will now look like the following:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
| key_1 | 1 | DELETE | 2 |
If you are running incrementally in your raw
ingestion, these rows will not be automatically deleted from your dataset; you'll have to explicitly write logic to detect these deleted records and remove them from your output clean
step. If these deletes are found, you'll have to SNAPSHOT
the output the remove them (unless you're doing lower-level file manipulations where you could remove the underlying file perhaps).
It's worth noting you'll want to materialize the DELETES
as late as possible (assuming your intermediate logic allows for it) since this will require a snapshot and will kill your overall pipeline performance.
If you aren't dealing with JDBC, then @Kellen's answer will apply.
Upvotes: 1