amba88
amba88

Reputation: 789

MySQL Python taking too long to query large database

I have a database with over 30,000 tables and ~40-100 rows in each table. I want to retrieve a list of table names which contain a string under a specific column.

So for example:

I want to retrieve the names of all tables which contain 'foo'...

Database
    Table_1
        ID: 1, STR: bar
        ID: 2, STR: foo
        ID: 3, STR: bar
    Table_2
        ID: 1, STR: bar
        ID: 2, STR: bar
        ID: 3, STR: bar
    Table_3
        ID: 1, STR: bar
        ID: 2, STR: bar
        ID: 3, STR: foo

So in this case the function should return ['Table_1', 'Table_3']

So far I have this, it works fine but takes over 2 minutes to execute, which is way too long for the application I have in mind.

self.m('SHOW TABLES')
result = self.db.store_result()
tablelist = result.fetch_row(0, 1)
for table in tablelist:
    table_name = table['Tables_in_definitions']
    self.m("""SELECT `def` FROM `""" + table_name + """` WHERE `def` = '""" + str + """'""")
    result = self.db.store_result()
    r = result.fetch_row(1, 1)
    if len(r) > 0:
        results.append(table_name)

I'm not smart enough to come up with a way to speed this up so if anyone has any suggestions it would be greatly appreciated, thanks!

Upvotes: 3

Views: 2711

Answers (1)

spencer7593
spencer7593

Reputation: 108380

If you are just testing for the existence of one row in each table where def = 'str', one easy thing to do (with no other changes) is to add a LIMIT 1 clause to the end of your query.

(If your query is performing a full table scan, MySQL can halt it once the first row is found. If no rows are found, the full table scan has to run to the end of the table.)

This also avoids overhead of preparing lots of rows to be returned to the client, and returning them to the client, if they aren't needed.

Also, an index with def as a leading column (at least on your largest tables) will likely help performance, if your query is looking through large tables for "a needle in haystack".


UPDATE:

I've re-read your question, and I see that you have 30,000 tables to check, that's 30,000 separate queries, 30,000 roundtrips to the database. (ACCCKKK.)

So my previous suggestion is pretty much useless. (That would be more appropriate with 40 tables each having 30,000 rows.)

Another approach would be to query a bunch of those tables at the same time. I'd be hesitant to even try more than a couple hundred tables at a time though, so I'd do it in batches.

SELECT DISTINCT 'Table1' AS table_name FROM Table1 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table2' FROM Table2 WHERE def = 'str'
 UNION ALL
SELECT DISTINCT 'Table3' FROM Table3 WHERE def = 'str'

If def is unique in each table, or, if it's nearly unique, and you can handle duplicate table_name values being returned, you could get rid of the DISTINCT keyword.

You do need to ensure that every table in the list has a column named def. If you encounter a table that doesn't have that column in it, the whole batch would fail. And a SHOW TABLES doesn't do that check of the column names. I'd be using a query like this to get the list of table names that have a column named def:

SELECT table_name
  FROM information_schema.columns
 WHERE table_schema = DATABASE()
   AND column_name = 'def'
 GROUP BY table_name
 ORDER BY table_name

Upvotes: 3

Related Questions