datawizard
datawizard

Reputation: 73

How does Foundry Magritte append ingestion handle deleted rows in the data source?

If I have a Magritte ingestion that is set to append, will it detect if rows are deleted in the source data? Will it also delete the rows in the ingested dataset?

Upvotes: 5

Views: 620

Answers (2)

Kellen Donohue
Kellen Donohue

Reputation: 787

If this is a file-based ingetsion (as opposed to JDBC) magritte ingestion operates on files not on rows. If your transaction type for the ingestion is set to UPDATE, and you make changes to the file, including deleting rows, then when the ingestion runs the new file will completely replace the existing file in that dataset, so any changes made in the file will be reflected in the dataset.

Two additional notes:

  1. If you have the exclude files already synced filter, you will probably want to have last modified date and/or file size options enabled or the modified file won't be ingested. exclude files already synced
  2. If your transaction type is set to APPEND and not UPDATE then the ingestion will fail because APPEND doesn't allow changes to existing files.

Upvotes: 1

vanhooser
vanhooser

Reputation: 1747

For your first question on if deletions are detected, this will depend on the database implementation you are extracting from (I'll assume this is JDBC for this answer). If this shows up as a modification and therefore a new row, then yes your deletes will show up.

This would look something like the following at first:

| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1       | 1   | CREATE      | 0         |
| key_2       | 2   | CREATE      | 0         |
| key_3       | 3   | CREATE      | 0         |

Followed by some updates (inside a subsequent run, incremental on update_ts:

| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1       | 1   | UPDATE      | 1         |
| key_2       | 2   | UPDATE      | 1         |

Now your database would have to explicitly mark any DELETE rows and increment the update_ts for this to be brought in:

| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1       | 1   | DELETE      | 2         |

After this, you would then be able to detect the deleted records and adjust accordingly. Your full materialized table view will now look like the following:

| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1       | 1   | CREATE      | 0         |
| key_2       | 2   | CREATE      | 0         |
| key_3       | 3   | CREATE      | 0         |
| key_1       | 1   | UPDATE      | 1         |
| key_2       | 2   | UPDATE      | 1         |
| key_1       | 1   | DELETE      | 2         |

If you are running incrementally in your raw ingestion, these rows will not be automatically deleted from your dataset; you'll have to explicitly write logic to detect these deleted records and remove them from your output clean step. If these deletes are found, you'll have to SNAPSHOT the output the remove them (unless you're doing lower-level file manipulations where you could remove the underlying file perhaps).

It's worth noting you'll want to materialize the DELETES as late as possible (assuming your intermediate logic allows for it) since this will require a snapshot and will kill your overall pipeline performance.

If you aren't dealing with JDBC, then @Kellen's answer will apply.

Upvotes: 1

Related Questions