Reputation: 2753
I have a database in SQLite I use to store hashes. Often times I will need query upwards of a hundred thousand hashes against the dataset of approximately 20 million hashes. There are several tables in the database containing these 20 some ought millions of hashes. Each table has the same structure.
For example,
ID,md5,sha1,source
0,abcd1234...,abcd4321...,bills_hashes
1,cdef5678...,cdef8765...,bobs_hashes
Currently, my code (Python) reads in the input text file and creates a list of comparison hashes. The hashes in the list are then created into a set/list (forgot what it's technically called in SQL). Then for each table in the database a query is created akin to the following and added to a list.
SELECT * FROM zacks_table WHERE md5 COLLATE NOCASE in ('hash1', 'hash2', ... 'hash156921');
SELECT * FROM gregs_table WHERE md5 COLLATE NOCASE in ('hash1', 'hash2', ... 'hash156921');
There is usually about 12 tables so about 12 queries are added to a list. Then all 12 queries are executed, and all the resulting return rows are collated and then processed.
The issue I'm having is that the querying / result gathering is taking a long time with the larger data sets. Once I get over 50,000 hashes it seems to take hours sometimes. The hash columns in the database are indexed.
How can I improve my queries to speed up my returns so I can process this data quickly (as can be)? I am using the sqlite3 module and Python 2.7 to perform these operations.
Upvotes: 0
Views: 701
Reputation: 27321
This seems to be a table structure problem more than a query problem.
First, table names like zacks_table
and gregs_table
usually indicate that there really should be a single table with username
as a column.
If your input is ~100K hashes I would first insert them into a temporary table, e.g.:
create temp table inputhash (hash varchar primary key)
and then insert all your input hashes in one go:
c.execute("insert into inputhash (hash) values (?)", [(hash1,), (hash2,),...])
Now you can get the result with a much smaller select:
select *
from zacks_table zack
join inputhash inpt on zack.md5 = inpt.hash
if you've created a data table with username as a column you could do
select *
from data_table data
join inputhash inpt on data.md5 = inpt.hash
and data.username in ('zack', 'greg')
Upvotes: 2