Reputation: 125
Due to the limitations of SQLite 3.7.13 updating columns on a JOIN, I created the following Python script to help with problem. Though due to the amount of data I am working with, I am running into system resource issues and the update method is taking too long.
I have a SQLite3 table that contains 7,159,587 records with the following schema:
dist_trpwrenc (id integer primary key autoincrement, IP TEXT, VNE_INTERNAL TEXT, VNE_ENTERPRISE TEXT, VNE_EXTERNAL TEXT)
I have a CSV file that contains 9,224,812 records that has duplication. Here is an example of the data that is in the CSV file:
"IP","VNE"
"192.168.1.1","internal"
"192.168.1.1","enterprise"
"192.168.1.1","external"
"192.168.2.1","internal"
"192.168.2.1","external"
The Python script is taking the CSV file and updating the "dist_trpwrenc" table per the example below:
--------------------------------------------------------------
| IP | VNE_INTERNAL | VNE_ENTERPRISE | VNE_EXTERNAL |
| 192.168.1.1 | x | x | x |
| 192.168.2.1 | x | | x |
--------------------------------------------------------------
I am looking for a faster method to handle the updates, is that possible with SQLite3/Python?
#!/usr/bin/python
from openpyxl.reader.excel import load_workbook
import sys, csv, sqlite3, logging, time, os, errno
s = time.strftime('%Y%m%d%H%M%S')
# Create exception file from standard output
class Logger(object):
def __init__(self):
self.terminal = sys.stdout
self.log = open((s)+"_log", "a")
def write(self, message):
self.terminal.write(message)
self.log.write(message)
def dist_trpwrenc_update():
sys.stdout = Logger()
con = sqlite3.connect(sys.argv[1]) # input database name (e.g. database.db) and creates in current working directory.
cur = con.cursor()
try:
with open(sys.argv[2], "rb") as f: # input CSV file
reader = csv.reader(f, delimiter=',')
for row in reader:
try:
ipupdate = (row[4],)
if row[3] == 'internal':
cur.execute('UPDATE dist_trpwrenc SET VNE_INTERNAL="x" WHERE IP=?;', ipupdate)
con.commit()
print row[0], row[4], 'updated:', row[3], ' successfully!'
elif row[3] == 'enterprise':
cur.execute('UPDATE dist_trpwrenc SET VNE_ENTERPRISE="x" WHERE IP=?;', ipupdate)
con.commit()
print row[0], row[4], 'updated:', row[3], ' successfully!'
elif row[3] == 'external':
cur.execute('UPDATE dist_trpwrenc SET VNE_EXTERNAL="x" WHERE IP=?;', ipupdate)
con.commit()
print row[0], row[4], 'updated:', row[3], ' successfully!'
else:
print row[0], row[4], 'did not update:', row[3], ' successfully.'
except (KeyboardInterrupt, SystemExit):
raise
except IOError:
raise
# Close SQLite database connection
con.close()
# Stop logging
sys.stdout = sys.__stdout__
def main():
dist_trpwrenc_update()
if __name__=='__main__':
main()
Thanks for all the tips, I went with another method by just using a SQL CASE statement:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, csv, sqlite3, logging, time, os, errno
# Functions
s = time.strftime('%Y%m%d%H%M%S')
# Create file from standard output for database import
class Logger(object):
def __init__(self):
self.terminal = sys.stdout
self.log = open((s) + "_" + sys.argv[1], "a")
def write(self, message):
self.terminal.write(message)
self.log.write(message)
# Function to create CSV from a SQL query.
def sqlExport():
sys.stdout = Logger() # Start screen capture to log file
con = sqlite3.connect(sys.argv[1]) # input database name (e.g. database.db) and creates in current working directory.
cur = con.cursor()
try:
cur.execute('SELECT network, SUM(case when VNE = "V1" then 1 end) as VNECH1, SUM(case when VNE = "V2" then 1 end) as VNECH2, SUM(case when VNE = "V3" then 1 end) as VNECH3 from data_table GROUP by network ORDER BY network;')
data = cur.fetchall()
for row in data:
print '"'+row[0]+'","'+str(row[1])+'","'+str(row[2])+'","'+str(row[3])+'"'
except (KeyboardInterrupt, SystemExit):
raise
con.close()
sys.stdout = sys.__stdout__ # stops capturing data from database export.
# Primary function to execute
def main():
sqlExport()
if __name__=='__main__':
main()
Upvotes: 0
Views: 1219
Reputation: 1054
If all that still isn't enough then if it fits your application (for example the CSV contains all the rows from the dist_trpwrenc table) deleting the existing records or table and using INSERT queries to repopulate it from the CSV may be quicker than a lot of UPDATES.
Upvotes: 1