Reputation: 9705
I'm running the following:
.mode tabs
CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
.import mytable.tsv mytable
mytable.tsv
is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.
I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.
How do I improve this process?
PS: Yes I did search for an previous question and answer, but nothing I found helped.
Upvotes: 2
Views: 2048
Reputation: 1572
For me the following two Pragmas made a significant improvement:
PRAGMA synchronous = 0;
PRAGMA locking_mode = EXCLUSIVE;
Importing 20k SQL-Statements went from 70 seconds to 1,5 seconds
Upvotes: 0
Reputation: 9705
The following settings helped speed things up tremendously.
PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2
Upvotes: 3
Reputation: 52579
In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER
column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID
one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).
Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.
You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF
. That results in fewer disc writes over the course of the import.
Upvotes: 4
Reputation: 165536
My assumption is the problem is the text primary key. This requires building a large and expensive text index.
The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).
Text primary keys have few advantages and many drawbacks.
Consider what happens when you make a new table which references this one.
create table othertable(
myrefrence references mytable, -- this is text
something integer,
otherthing integer
)
othertable
now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.
Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.
create table sequences(
id integer primary key autoincrement,
sequence text not null unique,
field1 integer not null,
field2 real not null
);
Now references to sequences
are a simple integer.
Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.
First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.
$ cat test.tsv
sequence field1 field2
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
sqlite> .import test.tsv import_sequences
As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences
with everything of type text
.
sqlite> .schema import_sequences
CREATE TABLE import_sequences(
"sequence" TEXT,
"field1" TEXT,
"field2" TEXT
);
sqlite> select * from import_sequences;
sequence field1 field2
---------- ---------- ----------
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
Now we create the final production table.
sqlite> create table sequences(
...> id integer primary key autoincrement,
...> sequence text not null unique,
...> field1 integer not null,
...> field2 real not null
...> );
For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.
Now transfer the data from the import table into sequences
. The primary key will be automatically populated.
insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;
Because the sequence
must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.
Once you've confirmed the data came over correctly, drop the import table.
Upvotes: 2