Reputation: 570
I'm working on maintaining a database which stores data transfer information between different networks. Essentially, each data transfer is logged and at the end of each month I run a perl script that loads the log files into a table in the database. I did not design the perl script or the database schema. It was done before I started working on this project.
I used this link to retrieve the primary keys of the table (usage_detail is the name of the table) and it gave me nothing. Since, there are so many records in the table, its not very easy to keep track of duplicates. We've had problems where a lot of duplicates were loaded (because of bugs script that does the logging for each transfer but thats for another topic) and ended up having to drop the latest load and reload all the new ones after fixing the log files. As you may have guessed this is stupid and tedious.
To fix this, I would like to add a primary key to the table. Due to several reasons, we don't want to add an entire new column for the primary keys. After looking at the fields, I've figured out a multi-column primary key. Basically it consists of: transfer start timestamp, transfer end timestamp, name of file transferred (which also includes the entire path). It seems highly unlikely that there would be two records which have those fields the same.
Here are my questions: 1) If I add this primary key in the table, what would happen to any duplicates that might already be present in the table?
2) How would I actually add this primary key to the table (we are using PostgreSQL 8.1.22).
3) After the primary key is added, lets say while the load script is running it tries to load a duplicate. What sort of error would PostgreSQL throw? Would I be able to catch it in the script?
4) I know you don't have much information about the load script, but given the information that I have provided do you foresee something that might need to changed in the script?
Any help is greatly appreciated. Thanks.
Upvotes: 3
Views: 4522
Reputation: 658987
Your plan is to add a needlessly huge index for 40 million (!) rows. And you aren't even sure it's going to be unique. I would strongly advice against that route of action. Add a serial
column instead and be done with it:
ALTER TABLE tbl ADD COLUMN tbl_id serial PRIMARY KEY;
That's all you need to do. The rest happens automatically. More in the manual or in these closely related answers:
PostgreSQL primary key auto increment crashes in C++
Auto increment SQL function
Adding a serial
column is one-time operation, but expensive. The whole table has to be rewritten, blocking updates for the duration of the operation. Best done without concurrent load at off hours. I quote the manual here:
Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten. [...] Table and/or index rebuilds may take a significant amount of time for a large table; and will temporarily require as much as double the disk space.
Since this effectively rewrites the whole table, you might as well create a new table with a serial pk column, insert all rows from the old table, letting the serial fill with default values from its sequence, drop the old and rename the new. More in these closely related answers:
Updating database rows without locking the table in PostgreSQL 9.2
Add new column without table lock?
Make sure all your INSERT statements have a target list, then an additional column can't confuse them:
INSERT INTO tbl (col1, col2, ...) VALUES ...
Not:
INSERT INTO tbl VALUES ...
A serial
is implemented with an integer
column (4 bytes).
A primary key constraint is implemented with a unique index and a NOT NULL
constraint on the involved columns.
The contents of an index are stored much like tables. Additional physical storage is needed separately. More about physical storage in this related answer:
Calculating and saving space in PostgreSQL
Your index would include 2 timestamps (2 x 8 bytes) plus a lengthy filename incl. path (~ 50 bytes?) That would make the index around 2.5 GB bigger (40M x 60 .. something bytes) and all operations slower.
How to deal with "importing duplicates" depends on how you are importing data and how "duplicate" is defined exactly.
If we are talking about COPY
statements, one way would be to use a temporary staging table and collapse duplicates with a simple SELECT DISTINCT
or DISTINCT ON
in the INSERT
command:
CREATE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl LIMIT 0; -- copy structure without data and constraints
COPY tbl_tmp FROM '/path/to/file.csv';
INSERT INTO tbl (col1, col2, col3)
SELECT DISTINCT ON (col1, col2)
col1, col2, col3 FROM tbl_tmp;
Or, to also prohibit duplicates with already existing rows:
INSERT INTO tbl (col1, col2, col3)
SELECT i.*
FROM (
SELECT DISTINCT ON (col1, col2)
col1, col2, col3
FROM tbl_tmp
) i
LEFT JOIN tbl t USING (col1, col2)
WHERE t.col1 IS NULL;
The temp. table is dropped at the end of the session automatically.
But the proper fix would be to deal with the root of the error that produces duplicates in the first place.
1) You could not add the pk at all, if there is a single duplicate over all columns.
2) I would only touch a PostgreSQL database version 8.1 with a five-foot pole. It's hopelessly ancient, outdated and inefficient, not supported any more and probably has a number of unfixed security holes. Official Postgres versioning site.
@David already supplied the SQL statement.
3 & 4) A duplicate key violation. PostgreSQL throwing an error also means the whole transaction is rolled back. Catching that in a perl script cannot make the rest of the transaction go through. You would have to create a server-side script with plpgsql for instance, where you can catch exceptions.
Upvotes: 3
Reputation: 13911
Note: you are on an unsupported version of PostgreSQL (you should probably upgrade). I couldn't even test this on SqlFiddle. So, all answers are given based on PosgreSQL 9.1
Upvotes: 2