Reputation: 5373
I am writing a program to load data into a particular database. This is what I am doing right now ...
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
lRows = len(rows)
i, iN = 0, 1000
while True:
if iN >= lRows:
# write the last of the data, and break ...
iN = lRows
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
break
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
i += 1000
iN += 1000
cur.close()
conn.close()
I am aware about this question about the use of the COPY
command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.
I have a couple of questions on how to make things faster ...
cur.executemany()
statements and a single conn.commit()
at the end? This means that I will put a single conn.commit()
statement just before the cur.close()
statement.cur.executemany()
for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany()
on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time. Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.
Upvotes: 2
Views: 773
Reputation: 1356
As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit()
only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit()
. If you do conn.commit()
after each cur.executemany()
, you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.
The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference.
The case of inserting whole file within one cur.executemany()
has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.
I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.
The program itself looks fine, but I would still recommend you to take a look on COPY TO
command with UNLOGGED
or TEMPORARY
tables, it will really speed up your imports.
Upvotes: 1