Reputation:
I have a large collection of raw data (around 300million rows) with about 10% replicated data. I need to get the data into a database. For the sake of performance I'm trying to use SQL copy. The problem being when I commit the data, primary key exceptions prevent any of the data from being processed. Can I change the behavior of primary keys such that conflicting data is simply ignored, or replaced? I don't really care either way - I just need one unique copy of each of the data.
Upvotes: 1
Views: 570
Reputation: 28837
Use a select statement to select exactly the data you want to insert, without the duplicates.
Use that as a basis of a CREATE TABLE XYZ AS SELECT * FROM (query-just-non-dupes)
You might check out ASKTOM ideas on how to select the non-duplicate rows
Upvotes: 0
Reputation:
That's what I was considering doing, but was worried about performance of getting rid of 30million randomly placed rows in a 300million entry database. The duplicate data also has a spatial relationship which is why I wanted to try to fix the problem while loading the data rather than after I have it all loaded.
Upvotes: 0
Reputation: 1997
I think your best bet would be to drop the constraint, load the data, then clean it up and reapply the constraint.
Upvotes: 2