Doo Dah
Doo Dah

Reputation: 4029

Python SQLite cache in memory

I am using Python 2.7 and SQLite. I am building a database with millions of rows. I would like to only write out to disk occasionally, with the idea this will improve performance. My thought was to only call commit() from time to time. I have tried that with the code below. The selects in the middle show that we get consistent reads. But, when I look on disc, I see a file example.db-journal. This must be where the data is being cached. In which case this would gain me nothing in terms of performance. Is there a way to have the inserts collect in memory, and then flush them to disc? Is there a better way to do this?

Here is my sample code:

import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('CREATE TABLE if not exists stocks (date text, trans text, symbol text, qty real, price real)')

c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get 2 rows as expected.
print c.fetchall()
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")

conn.commit()

t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get all the rows as expected.
print c.fetchall()

conn.close()

Update:

Figured I would give an update with some code in case anyone runs across this problem. I am processing 5+ million lines from a text file and needed a place to store the data for more processing. I originally had all the data in memory, but, was running out of memory. So, I switched to SQLite for a disc cache. My original in memory version of the processing took ~36 secs per 50,000 rows from the original text file.

After measuring, my first cut on SQLite version of the batch processing took ~660 seconds for 50,000 lines. Based on the comments (thanks to the posters), I came up with the following code:

self.conn = sqlite3.connect('myDB.db', isolation_level='Exclusive')
self.cursor.execute('PRAGMA synchronous = 0')
self.cursor.execute('PRAGMA journal_mode = OFF') 

In addition, I commit after processing 1000 lines from my text file.

if lineNum % 1000 == 0:
    self.conn.commit()

With that, 50,000 lines from the text file now takes ~40 seconds. So, I added 11% to the overall time, but, memory is constant, which is more important.

Upvotes: 4

Views: 10216

Answers (2)

MrGumble
MrGumble

Reputation: 5766

In your case, you are creating a db-connection in autocommit mode, which means that every time you execute an INSERT statement, the database starts a transaction, executes the statement and commits. So your commit is - in this case - meaningless. See sqlite3 on python docs.

But you are correct that inserting a large quantity of rows should ideally be done within a transaction. This signals the connection, that it should record all the incoming INSERT statements in the journal-file, but delaying writing to the database file until the transaction is committed. Even though your execution is limited by the I/O operations, writing to the journal file is no serious performance penalty.

Upvotes: 1

Thomas Fenzl
Thomas Fenzl

Reputation: 4392

Firstly, are you sure you need this? For reading, the OS should cache the file anyway, and if you write a lot, not syncing to disc means you can lose data easily.

If you measure and identify this as a bottleneck, you can use an in-memory database using connect(':memory:') and get an iterator returning an sql dump on demand: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.iterdump

import sqlite3, os

in_memory = sqlite3.connect(':memory:')
# do stuff

con = sqlite3.connect('existing_db.db')
con.execute('drop table stocks')
for line in in_memory.iterdump():
    con.execute(line)

Again, measure if you need this. If you have enough data that it matters, think hard about using a different data store, for example a full blown DBMS like postgres.

Upvotes: 4

Related Questions