Reputation: 1
We have a data warehouse system in which we need to load data present on s3 in csv format to redshift tables. The only constraint is that only unique records be inserted into redshift.
To implement this we are using staging table in following manner.
CREATE A TEMPORARY TABLE.
COPY THE S3 FILE INTO THE TEMOPRARY TABLE.
BEGIN TRANSACTION
INSERT INTO {main redshift table} select from {join between staging table and main redshift table on a column which should be unique for a record to be unique}
END TRANSACTION
The join that is used in the select subquery returns those records which are present in the staging table but not in the main redshift table.
Is the above mechanism free from race conditions.
For example consider -
Main redshift table has no rows and an s3 file contains two records.
So when the same s3 file is loaded by two different process/requests. The select query for each request read the main redshift table as empty and the join returns both rows present in the staging table and the two rows are inserted twice, resulting in duplicate rows.
Upvotes: 0
Views: 2763
Reputation: 163
Sounds like a potential phantom read scenario. You could avoid this by setting the highest transaction isolation level, SERIALIZABLE.
But that can potentially be quite expensive and lead to deadlocks, so perhaps you would prefer changing your loading pipeline to execute load tasks one by one rather than having multiple load tasks execute on one table in parallel.
Upvotes: 0
Reputation: 4957
Move processed file in difference s3 location .
ie - 1 Suppose you app is pushing file in destination s1
2 Move file form s1 to your staging s2 ( from this place you have to populate redshift temporary table)
3 Move file form s2 to s3 .
4) Now do BEGIN TRANSACTION
INSERT INTO {main redshift table} select from {join between staging table and main redshift table on a column which should be unique for a record to be unique}
END TRANSACTION
Upvotes: 0