themantalope
themantalope

Reputation: 1040

peewee with bulk insert is very slow into sqlite db

I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:

with helper.db.atomic():
   for i in range(0,len(expression_samples),step):
      gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()

And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?

Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.

Upvotes: 3

Views: 2233

Answers (2)

coleifer
coleifer

Reputation: 26245

Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.

  • Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
  • Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
  • For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
  • As you noted, disabling auto increment for the bulk-load speeds things up.

Other information:

  • Use PRAGMA journal_mode=wal;
  • Use PRAGMA synchronous=0;
  • Use PRAGMA locking_mode=EXCLUSIVE;

Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:

http://sqlite.org/pragma.html

Upvotes: 5

mobiusklein
mobiusklein

Reputation: 1423

In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.

Can you try with helper.db.atomic():?

atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.

EDIT

Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?

Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.

Upvotes: -1

Related Questions