Reputation: 1720
I have few text files which contain URLs. I am trying to create a SQLite database to store these URLs in a table. The URL table has two columns i.e. primary key(INTEGER) and URL(TEXT).
I try to insert 100,000 entries in one insert command and loop till I finish the URL list. Basically, read all the text files content and save in list and then I use create smaller list of 100,000 entries and insert in table.
Total URLs in the text files are 4,591,415 and total text file size is approx 97.5 MB.
Problems:
When I chose file database, it takes around 7-7.5 minutes to insert. I feel this is not a very fast insert given that I have solid state hard-disk which has faster read/write. Along with that I have approximately 10GB RAM available as seen in task manager. Processor is i5-6300U 2.4Ghz.
The total text files are approx. 97.5 MB. But after I insert the URLs in the SQLite, the SQLite database is approximately 350MB i.e. almost 3.5 times the original data size. Since the database doesn't contain any other tables, indexes etc. this database size looks little odd.
For problem 1, I tried playing with parameters and came up with as best ones based on test runs with different parameters.
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 15px;
text-align: left;
}
<table style="width:100%">
<tr>
<th>Configuration</th>
<th>Time</th>
</tr>
<tr><th>50,000 - with journal = delete and no transaction </th><th>0:12:09.888404</th></tr>
<tr><th>50,000 - with journal = delete and with transaction </th><th>0:22:43.613580</th></tr>
<tr><th>50,000 - with journal = memory and transaction </th><th>0:09:01.140017</th></tr>
<tr><th>50,000 - with journal = memory </th><th>0:07:38.820148</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 </th><th>0:07:43.587135</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535 </th><th>0:07:19.778217</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535 </th><th>0:07:28.186541</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:06.539198</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535 </th><th>0:07:19.810333</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535 </th><th>0:08:22.856690</th></tr>
<tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535 </th><th>0:08:22.326936</th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096 </th><th>0:07:35.365883</th></tr>
<tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096 </th><th>0:07:15.183948</th></tr>
<tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:13.402985</th></tr>
</table>
I was checking online and saw this link https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/ where the system is much slower than what I but still performing very well. Two things, that stood out from this link were:
I have shared the python code and the files here: https://github.com/ksinghgithub/python_sqlite
Can someone guide me on optimizing this code. Thanks.
Environment:
Edit 1:: New performance chart based on the feedback received on UNIQUE constraint and me playing with cache size value.
self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
th, td {
padding: 15px;
text-align: left;
}
<table>
<tr>
<th>Configuration</th>
<th>Action</th>
<th>Time</th>
<th>Notes</th>
</tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>
<tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr>
</table>
Upvotes: 1
Views: 1443
Reputation: 2776
The UNIQUE constraint on column "url" is creating an implicit index on the URL. That would explain the size increase.
I don't think you can populate the table and afterwards add the unique constraint.
Your bottleneck is surely the CPU. Try the following:
pip install toolz
Use this method:
from toolz import partition_all
def add_blacklist_url(self, urls):
# print('add_blacklist_url:: entries = {}'.format(len(urls)))
start_time = datetime.now()
for batch in partition_all(100000, urls):
try:
start_commit = datetime.now()
self.cursor.executemany('''INSERT OR IGNORE INTO blacklist(url) VALUES(:url)''', batch)
end_commit = datetime.now() - start_commit
print('add_blacklist_url:: total time for INSERT OR IGNORE INTO blacklist {} entries = {}'.format(len(templist), end_commit))
except sqlite3.Error as e:
print("add_blacklist_url:: Database error: %s" % e)
except Exception as e:
print("add_blacklist_url:: Exception in _query: %s" % e)
self.db.commit()
time_elapsed = datetime.now() - start_time
print('add_blacklist_url:: total time for {} entries = {}'.format(records, time_elapsed))
The code was not tested.
Upvotes: 0
Reputation: 74655
SQLite uses auto-commit mode by default. This permits begin transaction
be to omitted. But here we want all the inserts to be in a transaction and the only way to do that is to start a transaction with begin transaction
so that all the statements that are going to be ran are all in that transaction.
The method executemany
is only a loop over execute
done outside Python that calls the SQLite prepare statement function only once.
The following is a really bad way to remove the last N items from a list:
templist = []
i = 0
while i < self.bulk_insert_entries and len(urls) > 0:
templist.append(urls.pop())
i += 1
It is better to do this:
templist = urls[-self.bulk_insert_entries:]
del urls[-self.bulk_insert_entries:]
i = len(templist)
The slice and del slice work even on an empty list.
Both might have the same complexity but 100K calls to append and pop costs a lot more than letting Python do it outside the interpreter.
Upvotes: 1