ssm
ssm

Reputation: 5373

speed improvement in postgres INSERT command

I am writing a program to load data into a particular database. This is what I am doing right now ...

        conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
        cur  = conn.cursor()

        lRows = len(rows)
        i, iN = 0, 1000
        while True:

            if iN >= lRows:
                # write the last of the data, and break ...
                iN = lRows
                values = [dict(zip(header, r)) for r in rows[i:iN]]
                cur.executemany( insertString, values )
                conn.commit()
                break

            values = [dict(zip(header, r)) for r in rows[i:iN]]
            cur.executemany( insertString, values )
            conn.commit()

            i  += 1000
            iN += 1000

        cur.close()
        conn.close()

I am aware about this question about the use of the COPY command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.

I have a couple of questions on how to make things faster ...

  1. Would it be better (or possible) to do many cur.executemany() statements and a single conn.commit() at the end? This means that I will put a single conn.commit() statement just before the cur.close() statement.
  2. I have always seen other people use cur.executemany() for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany() on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time.
  3. I am making a fresh connection to the database for every file that I am opening. I am doing this because, this process is taking me many days to complete and I dont want issues with connection to corrupt the entirety of the data, if the connection is lost at any time. I have over a thousand files that I need to go through. Are these thousand connections that we are making going to be a significant part of the time that is used for the process?
  4. Are there any other things that I am doing in the program that I shouldn't be doing that can shorten the total time for the process?

Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.

Upvotes: 2

Views: 773

Answers (1)

icuken
icuken

Reputation: 1356

  1. As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit() only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit(). If you do conn.commit() after each cur.executemany(), you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.

  2. The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference. The case of inserting whole file within one cur.executemany() has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.

  3. I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.

  4. The program itself looks fine, but I would still recommend you to take a look on COPY TO command with UNLOGGED or TEMPORARY tables, it will really speed up your imports.

Upvotes: 1

Related Questions