Reputation: 21
I am making an inverted index lookup table for my database in sqlite3. The database I have consists of certain bloggers and their posts.
I have a table post which has the columns id, text, blogger_id. This table consists of ~680 000 posts. And I want to make a table Blogger_Post_Word with the columns blogger_id, post_id, word_position, word_id.
I am using Python for this and I have tried a 2 ways before but both have their problems.
I saw online that the best way to insert large amounts of data is with a bulk insert. This means that I have to fetch all the posts and for each word in a post I have to store that locally so I can do a bulk insert later. This requires way to much memory that I don't have.
I have also tried inserting each word one by one but this just takes way to long.
Is there an efficient way to solve this problem or an sql statement that does this in one go?
Edit: This is my the code I'm using now:
@lru_cache()
def get_word_id(_word: str) -> int:
word_w_id = db.get_one('Word', ['word'], (word,))
if word_w_id is None:
db.insert_one('Word', ['word'], (word,))
word_w_id = db.get_one('Word', ['word'], (word,))
return word_w_id[0]
for post_id, text, creation_date, blogger_id in db.get_all('Post'):
split_text = text.split(' ')
for word_position, word in enumerate(split_text):
word_id = get_word_id(word)
db.insert_one('Blogger_Post_Word',
['blogger_id', 'post_id', 'word_position', 'word_id'],
(blogger_id, post_id, word_position, word_id))
The db is a class I wrote to handle the database, these are the functions in that class I use:
def get(self, table: str, where_cols: list = None, where_vals: tuple = None):
query = 'SELECT * FROM ' + table
if where_cols is not None and where_vals is not None:
where_cols = [w + '=?' for w in where_cols]
query += ' WHERE ' + ' and '.join(where_cols)
return self.c.execute(query, where_vals)
return self.c.execute(query)
def get_one(self, table: str, where_cols: list = None, where_vals: tuple = None):
self.get(table, where_cols, where_vals)
return self.c.fetchone()
def insert_one(self, table: str, columns: list, values: tuple):
query = self.to_insert_query(table, columns)
self.c.execute(query, values)
self.conn.commit()
def to_insert_query(self, table: str, columns: list):
return 'INSERT INTO ' + table + '(' + ','.join(columns) + ')' + ' VALUES (' + ','.join(['?' for i in columns]) + ')'
Upvotes: 1
Views: 1055
Reputation: 21
Okay I hope this helps anyone.
The problem was that indeed that insert one is too slow and I didn't have enough memory to store the whole list locally.
Instead I used a hybrid of the two and insert them into the database incrementaly.
I displayed the size of my list to determine the bottleneck. It seemed that 150 000 posts of the 680 000 was about my bottleneck. The total size of the list was about 4.5 GB.
from pympler.asizeof import asizeof
print(asizeof(indexed_data))
>>> 4590991936
I decide on an increment of 50 000 posts to keep everything running smooth.
This is now my code:
# get all posts
c.execute('SELECT * FROM Post')
all_posts = c.fetchall()
increment = 50000
start = 0
end = increment
while start < len(all_posts):
indexed_data = []
print(start, ' -> ', end)
for post_id, text, creation_date, blogger_id in all_posts[start:end]:
split_text = text.split(' ')
# for each word in the post add a tuple with blogger id, post id, word position in the post and the word to indexed_data
indexed_data.extend([(blogger_id, post_id, word_position, word) for word_position, word in enumerate(split_text)])
print('saving...')
c.executemany('''
INSERT INTO Inverted_index (blogger_id, post_id, word_position, word)
VALUES (?, ?, ?, ?)
''', indexed_data)
start += increment
if end + increment > len(all_posts):
end = len(all_posts)
else:
end += increment
Upvotes: 1