Reputation: 101
I have a staging table in PostgreSQL and use a simple INSERT INTO ... SELECT ...
query to copy data from the staging table to the final table. However, I occasionally encounter data loss issues after the operation is completed. I'm not sure why this is happening. staging table i have checked its correct (with correct data)
Here are some notable details:
code i am using
import pg8000
...
bigSql = f"INSERT INTO final_tbl({sqlCols}) SELECT {sqlCols} FROM staging_tbl"
print(' ->Inserting data...')
limit = 4000000
offset = 0
while True:
cursor.execute(bigSql+f" LIMIT {limit} OFFSET {offset}")
print('Inserted ', cursor.rowcount)
if cursor.rowcount==0:
break
offset += limit
connection.commit()
regarding server information, i am using azure vm with laravel forge panel provided postgressql.
Upvotes: -7
Views: 62
Reputation: 225164
With this approach, at the very least, you need an ORDER BY
clause – otherwise, different invocations of the query aren’t guaranteed to return results in the same order, and your LIMIT … OFFSET …
could select some duplicate rows and miss others entirely.
If you have concurrent write traffic, you also need to handle that.
Additionally, on the off chance that you’ve decided on this pagination solution without having a concrete reason for it, consider using only one query with no LIMIT
or OFFSET
. Doing a large bulk insert in one transaction isn’t necessarily anything to worry about – it won’t block concurrent reads or writes to the table, for example. Not only is this more reliable with respect to the issues of inconsistent order and concurrent writes, it should also be faster – unlike if you use a cursor, say, PostgreSQL can’t skip directly to an OFFSET
and still has to produce all the results being skipped over, so you’re causing it to repeat a lot of work this way. (Granted, you don’t get the same progress indications.)
Upvotes: 2