Fast query for selecting all records NOT in another table in MySQL

Question

I have a query that selects a table of nodes, then joins a table of titles to it. This is done by first joining an in-between table of node IDs and title IDs that allows a many-to-many relationship between the first two tables. Both joins are inner so that only nodes with a properly configured and existing title are selected. I believe this to all be clean and efficient - the problem is what follows:

There is also a fourth table that provides a simple hierarchy for nodes; node_parents. Each row has two fields; a node ID and a node ID that acts as that node's parent (node_id and parent_id). Some nodes do not have children configured in this database (ie. the node itself isn't marked as a parent in any row of the node_parents table) - these are the nodes I'm trying to select.

The additional criteria for these childless nodes is that they have a specific title configured - hence the subquery initially selecting from node_titles and then inner joining node_parents. The subquery also has a GROUP BY because some nodes are parents of multiple nodes, so their node_id will unnecessarily appear multiple times in the results. I should also point out that because of this the primary key for node_parents is a combination of the node_id and parent_id.

The query:

SELECT  `nodes`.`node_id`,
        `titles`.`title`
FROM `nodes`
INNER JOIN `node_titles`
ON `nodes`.`node_id` = `node_titles`.`node_id`
INNER JOIN `titles`
ON `node_titles`.`title_id` = `titles`.`title_id`
WHERE `nodes`.`node_id` NOT IN
    (
    SELECT `node_titles`.`node_id`
    FROM `node_titles`
    INNER JOIN `node_parents`
    ON `node_titles`.`node_id` = `node_parents`.`parent_id`
    WHERE `node_titles`.`title_id` = 1
    GROUP BY `node_titles`.`node_id`
    )
AND `titles`.`title_id` = 1

Tables sizes: nodes = ~32,000 node_titles = ~49,000 titles = 3 node_parents = ~55,000

The query takes around 16 minutes to complete. Can anyone provide any pointers? I have tried profiling the query - which doesn't have any long hangs, but it does repeat this cycle for what seems like every selected row:

| executing                      | 0.000005 |
| Copying to tmp table           | 0.515815 |
| Sorting result                 | 0.000053 |
| Sending data                   | 0.000028 |

I have also tried ditching the subquery and using a LEFT JOIN with a WHERE foo IS NOT NULL, but this still takes a long time to process - the profiler claims ~180 seconds for 'Copying to tmp table'.

Ultimately I suspect this might be an indexing problem - but either way I'd appreciate answers that aren't questioning the implementation of the query unless they are pursuing a possible cause of the slowdown (eg. yes, the titles and nodes must be in a many-to-many relationship). Thanks all, and further information on request!

Quassnoi · Accepted Answer

Remove the GROUP BY from the subquery:

SELECT  nodes.node_id,
        titles.title
FROM    nodes n
INNER JOIN
        node_titles nt
ON      nt.node_id = n.node_id
INNER JOIN
        titles t
ON      t.title_id = nt.title_id
WHERE   n.node_id NOT IN
        (
        SELECT  nti.node_id
        FROM    node_titles nti
        INNER JOIN 
                node_parents npi
        ON      npi.parent_id = nt.node_id
        WHERE   nti.title_id = 1
        )

Create the following indexes:

node_titles (node_id, title_id)
titles (title_id)
node_parents (parent_id)

Update:

Try this:

SELECT  nodes.node_id,
        titles.title
FROM    nodes n
INNER JOIN
        node_titles nt
ON      nt.node_id = n.node_id
        AND nt.title_id = 1
INNER JOIN
        titles t
ON      t.title_id = nt.title_id
WHERE   n.node_id NOT IN
        (
        SELECT  parent_id
        FROM    node_parents
        )

Fast query for selecting all records NOT in another table in MySQL

Answers (2)

Related Questions