Reputation: 5383
I have a Postgres database, and I have inserted some data into the table. Because of issues with the internet connection, some of the data couldn't be written.The file that I am trying to write into the database is large (about 330712484 rows - even the ws -l
command takes a while to complete.
Now, the column row_id
is the (integer) primary key, and is already indexed. Since some of the rows could not be inserted into the table, I wanted to insert these specific rows into the table. (I estimate only about 1.8% of the data isn't inserted into the table ...) As a beginning, I tried to see of the primary keys were inside the database like so:
conn = psycopg2.connect(connector)
cur = conn.cursor()
with open(fileName) as f:
header = f.readline().strip()
header = list(csv.reader([header]))[0]
print(header)
for i, l in enumerate(f):
if i>10: break
print(l.strip())
row_id = l.split(',')[0]
query = 'select * from raw_data.chartevents where row_id={}'.format(row_id)
cur.execute(query)
print(cur.fetchall())
cur.close()
conn.close()
Even for the first few rows of data, checking to see whether the primary key exists takes a really large amount of time.
What would be the fastest way of doing this?
Upvotes: 1
Views: 863
Reputation: 266
The fastest way to insert data in PostgreSQL is using the COPY protocol, which is implemented in psycopg2. COPY will not allow you to check if target id already exists, tho. Best option is to COPY your file content's into a temporary table then INSERT or UPDATE from this, as in the Batch Update article I wrote on my http://tapoueh.org blog a while ago.
With a recent enough version of PostgreSQL you may use
INSERT INTO ...
SELECT * FROM copy_target_table
ON CONFICT (pkey_name) DO NOTHING
Upvotes: 2
Reputation: 2081
Can i offer a work around. ?
The index will be checked for each row inserted, also Postgres performs the whole insert in a single transaction so you are effectively storing all this data to disk before its being written.
Could i suggest you drop the indexes to avoid this slow down, then split the file into smaller files using head -n [int] > newfile
or something similar. then performing the copy commands separately for each one.
Upvotes: 0