Insert performance with millions of rows

Question

I'm trying to use a Python script to parse the Wikipedia archives. (Yeah, I know.) Of course:

Wikipedia XML: 45.95 GB
Available memory: 16 GB

This precludes loading the file into memory, and going into virtual memory isn't going to fare much better. So in order to work with the data, I decided to parse the necessary information into a SQLite database. For the XML parsing, I used the ElementTree library, which performs quite well. I confirmed that running ONLY the XML parsing (just commenting out the database calls) it runs linearly, with no slowdowns as it traverses the file.

The problem comes with trying to insert MILLIONS of rows into the SQLite database (one per Wikipedia article). The simple version of the table that I'm using for testing is as follows:

CREATE TABLE articles(
    id INTEGER NOT NULL PRIMARY KEY,
    title TEXT NOT NULL UNIQUE ON CONFLICT IGNORE);

So I just have the id and a text field during this initial phase. When I start adding rows via:

INSERT OR IGNORE INTO articles(title) VALUES(?1);

it performs well at first. But at around 8 million rows in, it begins to slow down dramatically, by an order of magnitude or more.

Some detail is of course needed. I'm using cur.executemany() with a single cursor created before the insert statements. Each call to this function has a batch of about 100,000 rows. I don't call db.commit() until ALL of the million+ rows have been inserted. According to what I've read, executemany() shouldn't commit a transaction until db.commit() as long as there are only INSERT statements.

The source XML being read and the database being written are on two separate disks, and I've also tried creating the database in memory, but I see the slowdown regardless. I also tried the isolation_level=None option, adding the BEGIN TRANSACTION and COMMIT TRANSACTION calls myself at the beginning and end (so the entire parse sequence is one transaction), but it still doesn't help.

Some other questions on this site suggest that indexing is the problem. I don't have any indexes on the table. I did try removing the UNIQUE constraint and just limiting it to id INTEGER PRIMARY KEY and title TEXT NOT NULL but that also had no effect.

What's the best way to perform these types of insertions in SQLite for large data sets? Of course this simple query is just the first of many; there are other queries that will be more complex, involving foreign keys (ids of articles in this table) as well as insert statements with selects embedded (selecting an id from the articles table during an insert). These are bound to have the same problem but exacerbated by a large margin - where the articles table has less than 15 million rows, the other tables are probably going to have over a billion rows. So these performance issues are even more concerning.

Insert performance with millions of rows

Answers (1)

Related Questions