user1902853
user1902853

Reputation: 409

Why do inserts slow down so much as DB grows?

I'm doing a personal project that generates a lot of data, and I thought that storing it in a local DB made sense. However, I'm seeing insane slowdown as the DB grows, which makes it infeasible to run.

I made a simple test showing what I'm doing. I make a dictionary where I do a bunch of local processing (roughly 1 million entries), then batch insert that into the SQLite DB, then loop and do it all again. Here's the code:

from collections import defaultdict
import sqlite3
import datetime
import random

def log(s):
    now = datetime.datetime.now()
    print(str(now) + ": " + str(s))

def create_table():
    conn = create_connection()
    with conn:
        cursor = conn.cursor()

        sql = """
            CREATE TABLE IF NOT EXISTS testing (
                test text PRIMARY KEY,
                number integer
            );"""
        cursor.execute(sql)
    conn.close()

def insert_many(local_db):
    sql = """INSERT INTO testing(test, number) VALUES(?, ?) ON CONFLICT(test) DO UPDATE SET number=number+?;"""

    inserts = []
    for key, value in local_db.items():
        inserts.append((key, value, value))

    conn = create_connection()
    with conn:
        cursor = conn.cursor()
        cursor.executemany(sql, inserts)
    conn.close()

def main():
    i = 0
    log("Starting to process records")
    for i in range(1, 21):
        local_db = defaultdict(int)
        for j in range(0, 1000000):
            s = "Testing insertion " + str(random.randrange(100000000))
            local_db[s] += 1
        log("Created local DB for " + str(1000000 * i) + " records")
        insert_many(local_db)
        log("Finished inserting " + str(1000000 * i) + " records")

def create_connection():
    conn = None
    try:
        conn = sqlite3.connect('/home/testing.db')
    except Error as e:
        print(e)

    return conn

if __name__ == '__main__':
    create_table()
    main()

This runs great for a second, then slows down like crazy. Here's the output I just got:

2019-10-23 15:28:59.211036: Starting to process records
2019-10-23 15:29:01.308668: Created local DB for 1000000 records
2019-10-23 15:29:10.147762: Finished inserting 1000000 records
2019-10-23 15:29:12.258012: Created local DB for 2000000 records
2019-10-23 15:29:28.752352: Finished inserting 2000000 records
2019-10-23 15:29:30.853128: Created local DB for 3000000 records
2019-10-23 15:39:12.826357: Finished inserting 3000000 records
2019-10-23 15:39:14.932100: Created local DB for 4000000 records
2019-10-23 17:21:37.257651: Finished inserting 4000000 records
...

As you can see, the first million inserts take 9 seconds, then the next million take 16, then it balloons to 10 minutes, then an hour and 40 minutes (!). Is there something weird I'm doing that causes this crazy slowdown, or is this a limitation of sqlite?

Upvotes: 3

Views: 1201

Answers (2)

peak
peak

Reputation: 116690

Using your program with one (minor?) modification, I got the very reasonable-looking timings shown immediately below. The modification was to use sqlite3.connect instead of pysqlite.connect.

Timings using sqlite3.connect

The # comments are approximations.

2019-10-23 13:00:37.843759: Starting to process records
2019-10-23 13:00:40.253049: Created local DB for 1000000 records
2019-10-23 13:00:50.052383: Finished inserting 1000000 records          # 12s
2019-10-23 13:00:52.065007: Created local DB for 2000000 records
2019-10-23 13:01:08.069532: Finished inserting 2000000 records          # 18s
2019-10-23 13:01:10.073701: Created local DB for 3000000 records
2019-10-23 13:01:28.233935: Finished inserting 3000000 records          # 20s
2019-10-23 13:01:30.237968: Created local DB for 4000000 records
2019-10-23 13:01:51.052647: Finished inserting 4000000 records          # 23s
2019-10-23 13:01:53.079311: Created local DB for 5000000 records
2019-10-23 13:02:15.087708: Finished inserting 5000000 records          # 24s
2019-10-23 13:02:17.075652: Created local DB for 6000000 records
2019-10-23 13:02:41.710617: Finished inserting 6000000 records          # 26s
2019-10-23 13:02:43.712996: Created local DB for 7000000 records
2019-10-23 13:03:18.420790: Finished inserting 7000000 records          # 37s
2019-10-23 13:03:20.420485: Created local DB for 8000000 records
2019-10-23 13:04:03.287034: Finished inserting 8000000 records          # 45s
2019-10-23 13:04:05.593073: Created local DB for 9000000 records
2019-10-23 13:04:57.871396: Finished inserting 9000000 records          # 54s
2019-10-23 13:04:59.860289: Created local DB for 10000000 records       
2019-10-23 13:05:54.527094: Finished inserting 10000000 records # 57s
...

The costs of TEXT PRIMARY KEY

I believe the main reason for the slowing down is the definition of test as TEXT PRIMARY KEY. That entails huge indexing costs, as suggested by this snippet from a run in which the "PRIMARY KEY" and "ON CONFLICT" declarations are removed:

2019-10-23 13:26:22.627898: Created local DB for 10000000 records
2019-10-23 13:26:24.010171: Finished inserting 10000000 records
...
2019-10-23 13:26:58.350150: Created local DB for 20000000 records
2019-10-23 13:26:59.832137: Finished inserting 20000000 records

That's less than 1.4s at the 10-million-records mark, and not much more at the 20-million-records mark.

Upvotes: 2

Marat
Marat

Reputation: 15738

(More of an extended comment than an answer)

SQLite only supports BTree indexes. For strings, which might have different length, the tree is storing row ids. Reading complexity of a tree is O(log(n)), where n is the length of the table, however, it will be multiplied by the complexity of reading and comparing a string value from the table. So, unless there is a really good reason to, it is better to have an integer field as a primary key.

What makes the situation worse in this case, the strings you're inserting have fairly long shared prefix ("Testing insertion "), so searching for the first mismatch takes longer.

Speedup suggestions, in order of expected effect size:

  • Real databases (MariaDB, Postgres) support hash indexes, which will solve this problem.
  • disable autocommit (skip unnecessary disk writes; very expensive)
  • reverse the text string (number before fixed text), or even just leave the number part only
  • use mass insertions (multiple records in one statement)

The @peak's answer avoids the whole problem by not using index. If index is not required at all, this is definitely a way to go.

Upvotes: 2

Related Questions