Reputation: 4029
I am using Python 2.7 and SQLite. I am building a database with millions of rows. I would like to only write out to disk occasionally, with the idea this will improve performance. My thought was to only call commit() from time to time. I have tried that with the code below. The selects in the middle show that we get consistent reads. But, when I look on disc, I see a file example.db-journal. This must be where the data is being cached. In which case this would gain me nothing in terms of performance. Is there a way to have the inserts collect in memory, and then flush them to disc? Is there a better way to do this?
Here is my sample code:
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('CREATE TABLE if not exists stocks (date text, trans text, symbol text, qty real, price real)')
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get 2 rows as expected.
print c.fetchall()
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
conn.commit()
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get all the rows as expected.
print c.fetchall()
conn.close()
Update:
Figured I would give an update with some code in case anyone runs across this problem. I am processing 5+ million lines from a text file and needed a place to store the data for more processing. I originally had all the data in memory, but, was running out of memory. So, I switched to SQLite for a disc cache. My original in memory version of the processing took ~36 secs per 50,000 rows from the original text file.
After measuring, my first cut on SQLite version of the batch processing took ~660 seconds for 50,000 lines. Based on the comments (thanks to the posters), I came up with the following code:
self.conn = sqlite3.connect('myDB.db', isolation_level='Exclusive')
self.cursor.execute('PRAGMA synchronous = 0')
self.cursor.execute('PRAGMA journal_mode = OFF')
In addition, I commit after processing 1000 lines from my text file.
if lineNum % 1000 == 0:
self.conn.commit()
With that, 50,000 lines from the text file now takes ~40 seconds. So, I added 11% to the overall time, but, memory is constant, which is more important.
Upvotes: 4
Views: 10216
Reputation: 5766
In your case, you are creating a db-connection in autocommit mode, which means that every time you execute an INSERT
statement, the database starts a transaction, executes the statement and commits. So your commit
is - in this case - meaningless. See sqlite3 on python docs.
But you are correct that inserting a large quantity of rows should ideally be done within a transaction. This signals the connection, that it should record all the incoming INSERT
statements in the journal-file, but delaying writing to the database file until the transaction is committed. Even though your execution is limited by the I/O operations, writing to the journal file is no serious performance penalty.
Upvotes: 1
Reputation: 4392
Firstly, are you sure you need this? For reading, the OS should cache the file anyway, and if you write a lot, not syncing to disc means you can lose data easily.
If you measure and identify this as a bottleneck, you can use an in-memory database using connect(':memory:')
and get an iterator returning an sql dump on demand: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.iterdump
import sqlite3, os
in_memory = sqlite3.connect(':memory:')
# do stuff
con = sqlite3.connect('existing_db.db')
con.execute('drop table stocks')
for line in in_memory.iterdump():
con.execute(line)
Again, measure if you need this. If you have enough data that it matters, think hard about using a different data store, for example a full blown DBMS like postgres.
Upvotes: 4