S.Keizer
S.Keizer

Reputation: 53

PYODBC Insert Statement in MS Access DB Extremely slow

I am looking to speed up my insert statement into Access Db. the data is only 86500 records and is takikng maorew than 24 hours to process. The part of the code i am looking to speed up is comparing two tables for duplicates. If no duplicates are found then insert that row. I am running 64bit windows 10, 32bit python 2.7, 32bit ms access odbc driver and a 32 bit pyodbc module. Any help would be greatly appreciated the code sample is below.

def importDIDsACC():
    """Compare the Ledger to ImportDids to find any missing records"""
    imdidsLst = []
    ldgrLst = readMSAccess("ActivityNumber", "Ledger")
    for x in readMSAccess("DISP_NUM", "ImportDids"):
        if x not in ldgrLst and x not in imdidsLst:
            didsLst.append(x)
    #Select the records to import
    if len(imdidsLst) > 0:
        sql = ""
        for row in imdidsLst:
            sql += "DISP_NUM = '" + row[0]
            cursor.execute("SELECT * FROM ImportDids WHERE " + sql)
            rows = cursor.fetchall()
            #Import to Ledger
            dupChk = []
            for row in rows:
                if row[4] not in dupChk:
                    cursor.execute('INSERT into Ledger ([ActivityNumber], [WorkArea], [ClientName], [SurfacePurpose], [OpsApsDist], [AppDate], [LOADate], [EffDate], [AmnDate], [CanDate], [RenDate], [ExpDate], [ReiDate], [AmlDate], [DispType], [TRM], [Section], [Quarter], [Inspected_Date], [Inspection_Reason], [Inspected_By], [InspectionStatus], [REGION], [DOC], [STATCD]) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)',
                                   str(row[1]), str(row[18]), str(row[17]), row[14], str(row[26]), row[4], row[5], row[6], row[7], row[8], row[9], row[10], row[11], row[12], str(row[1][0:3]), trmCal(str(row[21]),str(row[20]), str(row[19])), str(row[22]), str(row[23]), inspSts(str(row[1]), 0),inspSts(str(row[1]), 1), inspSts(str(row[1]), 2), inspSts(str(row[1]), 3), str(row[27]), str(row[3]), str(row[13]))
                    dupChk.append(row[4])
            cnxn.commit()

def readMSAccess(columns, table):
    """Select all records from the chosen field"""
    sql = "SELECT "+ columns +  " FROM " + table
    cursor.execute(sql)
    rows = cursor.fetchall()
    return rows

def dbConn():
    """Connects to Access dataBase"""
    connStr = """
    DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};
    DBQ=""" + getDatabasepath() + ";"
    cnxn = pyodbc.connect(connStr)
    cursor = cnxn.cursor()
    return cursor, cnxn

def getDatabasepath():
    """get the path to the access database"""
    mycwd = os.getcwd()
    os.chdir("..")
    dataBasePath = os.getcwd() + os.sep + "LandsAccessTool.accdb"
    os.chdir(mycwd)
    return dataBasePath

# Connect to the Access Database
cursor, cnxn = dbConn()

# Update the Ledger with any new records from importDids
importDIDsACC()

Upvotes: 0

Views: 453

Answers (1)

C Perkins
C Perkins

Reputation: 3886

Don't use external code to check for duplicates. The power of a database (even Access) is maximizing its data-set operations. Don't try to rewrite that kind of code, especially since as you've discovered it is not efficient. Instead, import everything into a temporary database table, then use Access (or the appropriate Access Data Engine) to execute SQL statements to compare tables, either finding or excluding duplicate rows. Results of those queries can then be used to create and/or update other tables--all within context of the database engine. Of course set up the temporary table(s) with appropriate indexes and keys to maximize the efficiency.

In the mean time, it is usually faster (am I allowed to say always?) when comparing data sets locally (i.e. tables) to load all values into some searchable collection from a single database request (i.e. SQL SELECT statement), then use that in-memory collection to search for matches. This may seem ironic after my last statement about maximizing the database capabilities, but the big idea is understanding how the data set as a whole is being processed. Transporting the data back and forth between python processes and the database engine, even if it is on the same machine, will be much slower than either processing everything within the python process or everything within the database engine process. The only time that might not be useful is when the remote dataset is much too large to download, but 87,000 key values is definitely small enough to load all the values into a python collection.

Upvotes: 1

Related Questions