Cort Ammon
Cort Ammon

Reputation: 10893

Search for many rows with many distinct values

I am looking to implement a cache in MySQL. I'm doing this because the app I'm developing needs a cache but has no persistence.

I am keeping a collection of SHA-1 checksums for files. The user can provide me with a list of checksums, and I need to return to them which checksums I do not have in my database. I'm expecting the number of checksums in my database to be on the order of a million, and the number of values to check to be on the order of a hundred thousand.

The naive approach would be to issue a long string of

SELECT COUNT(*) FROM myTable WHERE checksum = '2fd4e1c67a2d28fced849ee1bb76e7391b93eb12';

This could be optimized somewhat by using the ability to use parameters with the Python connector. However, this seems like a lot of churn across the connection between the python process and the MySQL process.

I know there is an approach along the lines of SELECT checksum FROM myTable NOT IN (SELECT someOtherTable WHERE someClause), which is a match for what I need except that the list of checksums to check against is in my Python process, not a MySQL table.

What is the best way to find checksums that are not in a list which is initially only available in my Python process (so this must be shared across to MySQL across IPC)?

Upvotes: 0

Views: 62

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1270463

I need to return to them which checksums I do not have in my database.

You would basically do:

select c.checksum
from (select ? as checksum union all
      select ? as checksum union all
      . . .
     ) c 
where not exists (select 1
                  from mytable t
                  where t.checksum = c.checksum
                 );

The ?s are for parameter placeholders. If the user provides a finite but not too large number, you can use this.

If the user is providing a bunch, you may want to construct the query as dynamic SQL.

If the "user" is really getting them from the database, then you should use the same logic and keep the data in the database. That is, the c derived table would be a subquery.

Finally, if they are coming from a file, you might want to load that into a temporary table for the logic.

Upvotes: 0

Himanshu
Himanshu

Reputation: 3970

Not sure what you are exactly asking. As far as i could get somewhere around yohr query is optimization. Better option is use NOT EXISTS and have an index on the unique or primary key of your table for fast retrieval.

Indexing the table gives the order to search from. Hence, faster retrieval.

  SELECT checksum FROM myTable t
   Where not exists(SELECT 1 
      From someOtherTable 
    WHERE someClause=t.someid),

Upvotes: 1

Akina
Akina

Reputation: 42728

Your table must have an index by checksum field.

When user provides a list of checksums you must create temporary table, store these values into it, build the index by shecksum field and search using sommon simple query with JOIN.

The fastest method to insert the values list to be searched for is to save them into texst file and load into temporary table using LOAD DATA INFILE. Applicable if client have enough rights to create a file in the MySQL's server filesystem.

If it is impossible then insert the values by chunks.

Upvotes: 0

Related Questions