Christopher
Christopher

Reputation: 125

Speeding up SQLite3 UPDATE function using CSV file

Due to the limitations of SQLite 3.7.13 updating columns on a JOIN, I created the following Python script to help with problem. Though due to the amount of data I am working with, I am running into system resource issues and the update method is taking too long.

I have a SQLite3 table that contains 7,159,587 records with the following schema:

dist_trpwrenc (id integer primary key autoincrement, IP TEXT, VNE_INTERNAL TEXT, VNE_ENTERPRISE TEXT, VNE_EXTERNAL TEXT)

I have a CSV file that contains 9,224,812 records that has duplication. Here is an example of the data that is in the CSV file:

"IP","VNE"
"192.168.1.1","internal"
"192.168.1.1","enterprise"
"192.168.1.1","external"
"192.168.2.1","internal"
"192.168.2.1","external"

The Python script is taking the CSV file and updating the "dist_trpwrenc" table per the example below:

--------------------------------------------------------------
|      IP     | VNE_INTERNAL | VNE_ENTERPRISE | VNE_EXTERNAL |
| 192.168.1.1 |      x       |       x        |      x       |
| 192.168.2.1 |      x       |                |      x       |
--------------------------------------------------------------

I am looking for a faster method to handle the updates, is that possible with SQLite3/Python?

#!/usr/bin/python

from openpyxl.reader.excel import load_workbook
import sys, csv, sqlite3, logging, time, os, errno

s = time.strftime('%Y%m%d%H%M%S')

# Create exception file from standard output
class Logger(object):
    def __init__(self):
        self.terminal = sys.stdout
        self.log = open((s)+"_log", "a")

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

def dist_trpwrenc_update():
    sys.stdout = Logger()

    con = sqlite3.connect(sys.argv[1]) # input database name (e.g. database.db) and creates in current working directory.
    cur = con.cursor()

    try:
        with open(sys.argv[2], "rb") as f: # input CSV file 
            reader = csv.reader(f, delimiter=',')
            for row in reader:
                try:
                    ipupdate = (row[4],)

                    if row[3] == 'internal':
                        cur.execute('UPDATE dist_trpwrenc SET VNE_INTERNAL="x" WHERE IP=?;', ipupdate)
                        con.commit()
                        print row[0], row[4], 'updated:', row[3], ' successfully!'  
                    elif row[3] == 'enterprise':
                        cur.execute('UPDATE dist_trpwrenc SET VNE_ENTERPRISE="x" WHERE IP=?;', ipupdate)
                        con.commit()
                        print row[0], row[4], 'updated:', row[3], ' successfully!'
                    elif row[3] == 'external':
                        cur.execute('UPDATE dist_trpwrenc SET VNE_EXTERNAL="x" WHERE IP=?;', ipupdate)
                        con.commit()
                        print row[0], row[4], 'updated:', row[3], ' successfully!'
                    else:
                        print row[0], row[4], 'did not update:', row[3], ' successfully.'
                except (KeyboardInterrupt, SystemExit):
                    raise
    except IOError:
        raise

    # Close SQLite database connection  
    con.close()

    # Stop logging
    sys.stdout = sys.__stdout__

def main():
    dist_trpwrenc_update()

if __name__=='__main__':
    main()

Thanks for all the tips, I went with another method by just using a SQL CASE statement:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys, csv, sqlite3, logging, time, os, errno

# Functions
s = time.strftime('%Y%m%d%H%M%S')

# Create file from standard output for database import
class Logger(object):
    def __init__(self):
        self.terminal = sys.stdout
        self.log = open((s) + "_" + sys.argv[1], "a")        

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

# Function to create CSV from a SQL query.
def sqlExport():
    sys.stdout = Logger() # Start screen capture to log file

    con = sqlite3.connect(sys.argv[1]) # input database name (e.g. database.db) and creates in current working directory.
    cur = con.cursor()

    try:
        cur.execute('SELECT network, SUM(case when VNE = "V1" then 1 end) as VNECH1, SUM(case when VNE = "V2" then 1 end) as VNECH2, SUM(case when VNE = "V3" then 1 end) as VNECH3 from data_table GROUP by network ORDER BY network;')
        data = cur.fetchall()

        for row in data:
            print '"'+row[0]+'","'+str(row[1])+'","'+str(row[2])+'","'+str(row[3])+'"'

    except (KeyboardInterrupt, SystemExit):
        raise

    con.close()
    sys.stdout = sys.__stdout__ # stops capturing data from database export.

# Primary function to execute
def main():
    sqlExport()

if __name__=='__main__':
    main()

Upvotes: 0

Views: 1219

Answers (1)

George
George

Reputation: 1054

  1. Make sure there is an index on the IP field.
  2. Add all the rows from the CSV into a set first so as to remove the duplication and hopefully save a couple of million operations.
  3. Remove the logging function. You can assume that the update has completed because otherwise an exception would have been raised so the log doesn't tell you anything.
  4. Experiment with reducing the frequency that commits happen - though doing all the updates in a single transaction may have its own problems.

If all that still isn't enough then if it fits your application (for example the CSV contains all the rows from the dist_trpwrenc table) deleting the existing records or table and using INSERT queries to repopulate it from the CSV may be quicker than a lot of UPDATES.

Upvotes: 1

Related Questions