0xhughes
0xhughes

Reputation: 2753

What is the most efficient way to query multiple values from a single column in SQLite?

I have a database in SQLite I use to store hashes. Often times I will need query upwards of a hundred thousand hashes against the dataset of approximately 20 million hashes. There are several tables in the database containing these 20 some ought millions of hashes. Each table has the same structure.

For example,

ID,md5,sha1,source

0,abcd1234...,abcd4321...,bills_hashes

1,cdef5678...,cdef8765...,bobs_hashes

Currently, my code (Python) reads in the input text file and creates a list of comparison hashes. The hashes in the list are then created into a set/list (forgot what it's technically called in SQL). Then for each table in the database a query is created akin to the following and added to a list.

SELECT * FROM zacks_table WHERE md5 COLLATE NOCASE in ('hash1', 'hash2', ... 'hash156921');

SELECT * FROM gregs_table WHERE md5 COLLATE NOCASE in ('hash1', 'hash2', ... 'hash156921');

There is usually about 12 tables so about 12 queries are added to a list. Then all 12 queries are executed, and all the resulting return rows are collated and then processed.

The issue I'm having is that the querying / result gathering is taking a long time with the larger data sets. Once I get over 50,000 hashes it seems to take hours sometimes. The hash columns in the database are indexed.

How can I improve my queries to speed up my returns so I can process this data quickly (as can be)? I am using the sqlite3 module and Python 2.7 to perform these operations.

Upvotes: 0

Views: 701

Answers (1)

thebjorn
thebjorn

Reputation: 27321

This seems to be a table structure problem more than a query problem.

First, table names like zacks_table and gregs_table usually indicate that there really should be a single table with username as a column.

If your input is ~100K hashes I would first insert them into a temporary table, e.g.:

create temp table inputhash (hash varchar primary key)

and then insert all your input hashes in one go:

c.execute("insert into inputhash (hash) values (?)", [(hash1,), (hash2,),...])

Now you can get the result with a much smaller select:

select * 
from zacks_table zack
join inputhash inpt on zack.md5 = inpt.hash

if you've created a data table with username as a column you could do

select * 
from data_table data
join inputhash inpt on data.md5 = inpt.hash 
 and data.username in ('zack', 'greg')

Upvotes: 2

Related Questions