Reputation: 39
I am building a DWH based on data I am collecting from an ERP API. currently, I am fetching the data from the API based on an incremental mechanism I built using python: The python script fetches all invoices whose last modified date is in the last 24 hours and inserts the data into a "staging table" (no changes are required during this step).
The next step is to insert all data from the staging area into the "final tables". The final tables include primary keys according to the ERP (for example invoice number). There are no primary keys defined at the staging tables. For now, I am putting aside the data manipulation and transformation.
In some cases, it's possible that a specific invoice is already in the "final tables", but then the user updates the invoice at the ERP system which causes the python script to fetch the data again from the API into the staging tables. In the case when I try to insert the invoice into the "final table" I will get a conflict due to the primary key restriction at the "final tables".
Any idea of how to solve this? I am thinking to add a field that details the date and timestamp at which the record land at the staging table ("insert date") and then upsert the records if insert date at the staging table > insert date at the final tables
Is this best practice? Any other suggestions? maybe use a specific tool/data solution? I prefer using python scripts since it is part of a wider project.
Thank you!
Upvotes: 0
Views: 66
Reputation: 89316
Instead of a straight INSERT use an UPSERT pattern. Either the MERGE statement if your database has it, or UPDATE the existing rows, followed by INSERTing the new ones.
Upvotes: 0