thc
thc

Reputation: 9705

SQLite is very slow when performing .import on a large table

I'm running the following:

  .mode tabs
  CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
  .import mytable.tsv mytable

mytable.tsv is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.

I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.

How do I improve this process?

PS: Yes I did search for an previous question and answer, but nothing I found helped.

Upvotes: 2

Views: 2048

Answers (4)

Birkenstab
Birkenstab

Reputation: 1572

For me the following two Pragmas made a significant improvement:

PRAGMA synchronous = 0;
PRAGMA locking_mode = EXCLUSIVE;

Importing 20k SQL-Statements went from 70 seconds to 1,5 seconds

Upvotes: 0

thc
thc

Reputation: 9705

The following settings helped speed things up tremendously.

PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2

Upvotes: 3

Shawn
Shawn

Reputation: 52579

In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).

Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.

You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF. That results in fewer disc writes over the course of the import.

Upvotes: 4

Schwern
Schwern

Reputation: 165536

My assumption is the problem is the text primary key. This requires building a large and expensive text index.

The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).

Text primary keys have few advantages and many drawbacks.

  • They require large, slow indexes. Slow to build, slow to query, slow to insert.
  • It's tempting to change text, exactly what you don't want a primary key to do.
  • Any table referencing it also requires storing and indexing text adding to bloat.
  • Joins with this table will be slower due to the text primary key.

Consider what happens when you make a new table which references this one.

create table othertable(
  myrefrence references mytable, -- this is text
  something integer,
  otherthing integer
)

othertable now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.

Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.

create table sequences(
  id integer primary key autoincrement,
  sequence text not null unique,
  field1 integer not null,
  field2 real not null
);

Now references to sequences are a simple integer.


Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.

First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.

$ cat test.tsv
sequence    field1  field2
d34db33f    1   1.1
f00bar  5   5.5
somethings  9   9.9

sqlite> .import test.tsv import_sequences

As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences with everything of type text.

sqlite> .schema import_sequences
CREATE TABLE import_sequences(
  "sequence" TEXT,
  "field1" TEXT,
  "field2" TEXT
);

sqlite> select * from import_sequences;
sequence    field1      field2    
----------  ----------  ----------
d34db33f    1           1.1       
f00bar      5           5.5       
somethings  9           9.9       

Now we create the final production table.

sqlite> create table sequences(
   ...>   id integer primary key autoincrement,
   ...>   sequence text not null unique,
   ...>   field1 integer not null,
   ...>   field2 real not null
   ...> );

For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.

Now transfer the data from the import table into sequences. The primary key will be automatically populated.

insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;

Because the sequence must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.

Once you've confirmed the data came over correctly, drop the import table.

Upvotes: 2

Related Questions