Dylan
Dylan

Reputation: 1678

Run Oracle update statements in a batch mode

I need to run a couple of relatively simple SQL update statements to update a single column in an Oracle table with 14.4 million rows. One statement runs a function written in Java and the JVM runs out of memory as I’m doing the update on all 14.4 million rows.

Have you written a kind of batch PL/SQL routine that can break this simple update into sets of, say, 10K records per batch? I know that if I can commit my updates after a bunch of records, it’ll go a lot faster and I won’t run out of memory. I’m sure there’s a simple way to do this using a FOR loop and row_num but I'm not making much progress.

Here are the two statements I need to run for each batch of n records:

first one:

 update vr_location l set l.usps_address=(
   select mylib.string_utils.remove_duplicate_whitespace(
   house_number || ' ' || pre_street_direction || ' ' || street_name || ' ' || 
   street_description || ' ' || post_street_direction)
 from vr_address a where a.address_pk=l.address_pk);

second:

update vr_location set usps_address = mylib.usaddress_utils.parse_address(usps_address);

Upvotes: 0

Views: 6752

Answers (3)

Dylan
Dylan

Reputation: 1678

Well, I had to get stuff done so I took your recommendations then did a little Python to do it. I ended up using cx_Oracle to give me good control over the transactions. Obviously PL/SQL would have been better but I don't know it. Python is my new hammer, and everything is a nail!

#!/usr/bin/env python
import csv
import time
import cx_Oracle

# Parses USPS addresses from voter addresses
# and inserts them into VR_LOCATION table ready
# for geocoding. Does batches by zipcode
def LoadZips():
    zipcodes = []
    zips = open('OH_ZIP_CODES.txt','r')
    for line in zips:
        zip = line[0:5]
        if zip not in zipcodes:
            zipcodes.append(zip)
    zips.close()
    return zipcodes

def UpdateAddresses(ziplist):
    counter = 1
    total = len(ziplist)

    for zipcode in ziplist:
        orcl = cx_Oracle.connect('voter/voter@oracle')
        curs = orcl.cursor()
        countsql = "select count(*) from vr_location where zip_co = '%s'" % zipcode
        concatsql = """update vr_location l set l.usps_address=(
                    select mizar.string_utils.remove_duplicate_whitespace(
                        house_number
                        ||' '||pre_street_direction
                        ||' '||street_name
                        ||' '||street_description
                        ||' '||post_street_direction)
                    from vr_address a where a.address_pk = l.address_pk)
                where zip_co = '%s'""" % zipcode
        parsesql = """update vr_location set usps_address =  mizar.usaddress_utils.parse_address(usps_address)
                where zip_co = '%s'""" % zipcode
        curs.execute(countsql)

        records_affected = curs.fetchone()[0]
    if records_affected == 0:
        print "No records for zipcode %s" % zipcode
        counter += 1 
        continue

    print "[%s] %s of %s: %s addresses" % (zipcode, counter, total, records_affected)
    curs.execute(concatsql)
    orcl.commit()
    curs.execute(parsesql)
    orcl.commit()
    curs.close()
    counter += 1 

    # Uncomment this to debug - just steps through X zipcodes      
    #if counter == 3:
    #    print "Cleaning up..."
    #    break


if __name__ == "__main__":
    start = time.clock()
    zipcodes = LoadZips()
    print "Processing addresses in %s zip codes" % len(zipcodes)
    UpdateAddresses(zipcodes)

Blog post on the topic

Upvotes: 0

dpbradley
dpbradley

Reputation: 11915

You (or your DBA) should size the UNDO properly and do this as a single SQL transaction

The advantages are:

  • read consistency on table while this is happening
  • you retain the ability to rollback the transaction in case something fails

If you're in some sort of loading environment where you don't care about either of these, then use CTAS (create table as select) to make a new table with the modified value, build the indexes, constraints, etc and then swap the table names. 14 million rows isn't that big these days.

Upvotes: 1

stili
stili

Reputation: 674

Do an initial select to retrieve some kind of grouping attribute, so that you end up with groups that have the desired number of rows. Experiment with the grouping clause, for instance the last three digits of a zip-code or something semi random.

Loop over the grouping clause, using the clause as parameter to limit the rows targeted by each update statement. commit at the end of each iteration.

Upvotes: 1

Related Questions