user13518301
user13518301

Reputation: 21

Inserting large amounts of data in sqlite

I am making an inverted index lookup table for my database in sqlite3. The database I have consists of certain bloggers and their posts.

I have a table post which has the columns id, text, blogger_id. This table consists of ~680 000 posts. And I want to make a table Blogger_Post_Word with the columns blogger_id, post_id, word_position, word_id.

I am using Python for this and I have tried a 2 ways before but both have their problems.

Is there an efficient way to solve this problem or an sql statement that does this in one go?

Edit: This is my the code I'm using now:

@lru_cache()
def get_word_id(_word: str) -> int:
    word_w_id = db.get_one('Word', ['word'], (word,))
    if word_w_id is None:
        db.insert_one('Word', ['word'], (word,))
        word_w_id = db.get_one('Word', ['word'], (word,))
    return word_w_id[0]

for post_id, text, creation_date, blogger_id in db.get_all('Post'):
    split_text = text.split(' ')

    for word_position, word in enumerate(split_text):
        word_id = get_word_id(word)

        db.insert_one('Blogger_Post_Word',
                      ['blogger_id', 'post_id', 'word_position', 'word_id'],
                      (blogger_id, post_id, word_position, word_id))

The db is a class I wrote to handle the database, these are the functions in that class I use:

def get(self, table: str, where_cols: list = None, where_vals: tuple = None):
    query = 'SELECT * FROM ' + table

    if where_cols is not None and where_vals is not None:
        where_cols = [w + '=?' for w in where_cols]
        query += ' WHERE ' + ' and '.join(where_cols)

        return self.c.execute(query, where_vals)

    return self.c.execute(query)

def get_one(self, table: str, where_cols: list = None, where_vals: tuple = None):
    self.get(table, where_cols, where_vals)
    return self.c.fetchone()



def insert_one(self, table: str, columns: list, values: tuple):
    query = self.to_insert_query(table, columns)
    self.c.execute(query, values)
    self.conn.commit()


def to_insert_query(self, table: str, columns: list):
    return 'INSERT INTO ' + table + '(' + ','.join(columns) + ')' + ' VALUES (' + ','.join(['?' for i in columns]) + ')'

Upvotes: 1

Views: 1055

Answers (1)

user13518301
user13518301

Reputation: 21

Okay I hope this helps anyone.

The problem was that indeed that insert one is too slow and I didn't have enough memory to store the whole list locally.

Instead I used a hybrid of the two and insert them into the database incrementaly.

I displayed the size of my list to determine the bottleneck. It seemed that 150 000 posts of the 680 000 was about my bottleneck. The total size of the list was about 4.5 GB.

from pympler.asizeof import asizeof
print(asizeof(indexed_data))

>>> 4590991936

I decide on an increment of 50 000 posts to keep everything running smooth.

This is now my code:

# get all posts
c.execute('SELECT * FROM Post')
all_posts = c.fetchall()

increment = 50000

start = 0
end = increment

while start < len(all_posts):
    indexed_data = []

    print(start, ' -> ', end)

    for post_id, text, creation_date, blogger_id in all_posts[start:end]:
        split_text = text.split(' ')

        # for each word in the post add a tuple with blogger id, post id, word position in the post and the word to indexed_data
        indexed_data.extend([(blogger_id, post_id, word_position, word) for word_position, word in enumerate(split_text)])

    print('saving...')
    c.executemany('''
    INSERT INTO Inverted_index (blogger_id, post_id, word_position, word)
    VALUES (?, ?, ?, ?)
    ''', indexed_data)

    start += increment
    if end + increment > len(all_posts):
        end = len(all_posts)
    else:
        end += increment

Upvotes: 1

Related Questions