Reputation: 3517
I have a query that selects a table of nodes, then joins a table of titles to it. This is done by first joining an in-between table of node IDs and title IDs that allows a many-to-many relationship between the first two tables. Both joins are inner so that only nodes with a properly configured and existing title are selected. I believe this to all be clean and efficient - the problem is what follows:
There is also a fourth table that provides a simple hierarchy for nodes; node_parents. Each row has two fields; a node ID and a node ID that acts as that node's parent (node_id and parent_id). Some nodes do not have children configured in this database (ie. the node itself isn't marked as a parent in any row of the node_parents table) - these are the nodes I'm trying to select.
The additional criteria for these childless nodes is that they have a specific title configured - hence the subquery initially selecting from node_titles and then inner joining node_parents. The subquery also has a GROUP BY because some nodes are parents of multiple nodes, so their node_id will unnecessarily appear multiple times in the results. I should also point out that because of this the primary key for node_parents is a combination of the node_id and parent_id.
The query:
SELECT `nodes`.`node_id`,
`titles`.`title`
FROM `nodes`
INNER JOIN `node_titles`
ON `nodes`.`node_id` = `node_titles`.`node_id`
INNER JOIN `titles`
ON `node_titles`.`title_id` = `titles`.`title_id`
WHERE `nodes`.`node_id` NOT IN
(
SELECT `node_titles`.`node_id`
FROM `node_titles`
INNER JOIN `node_parents`
ON `node_titles`.`node_id` = `node_parents`.`parent_id`
WHERE `node_titles`.`title_id` = 1
GROUP BY `node_titles`.`node_id`
)
AND `titles`.`title_id` = 1
Tables sizes: nodes = ~32,000 node_titles = ~49,000 titles = 3 node_parents = ~55,000
The query takes around 16 minutes to complete. Can anyone provide any pointers? I have tried profiling the query - which doesn't have any long hangs, but it does repeat this cycle for what seems like every selected row:
| executing | 0.000005 |
| Copying to tmp table | 0.515815 |
| Sorting result | 0.000053 |
| Sending data | 0.000028 |
I have also tried ditching the subquery and using a LEFT JOIN with a WHERE foo IS NOT NULL, but this still takes a long time to process - the profiler claims ~180 seconds for 'Copying to tmp table'.
Ultimately I suspect this might be an indexing problem - but either way I'd appreciate answers that aren't questioning the implementation of the query unless they are pursuing a possible cause of the slowdown (eg. yes, the titles and nodes must be in a many-to-many relationship). Thanks all, and further information on request!
Upvotes: 2
Views: 2020
Reputation: 425471
Remove the GROUP BY
from the subquery:
SELECT nodes.node_id,
titles.title
FROM nodes n
INNER JOIN
node_titles nt
ON nt.node_id = n.node_id
INNER JOIN
titles t
ON t.title_id = nt.title_id
WHERE n.node_id NOT IN
(
SELECT nti.node_id
FROM node_titles nti
INNER JOIN
node_parents npi
ON npi.parent_id = nt.node_id
WHERE nti.title_id = 1
)
Create the following indexes:
node_titles (node_id, title_id)
titles (title_id)
node_parents (parent_id)
Update:
Try this:
SELECT nodes.node_id,
titles.title
FROM nodes n
INNER JOIN
node_titles nt
ON nt.node_id = n.node_id
AND nt.title_id = 1
INNER JOIN
titles t
ON t.title_id = nt.title_id
WHERE n.node_id NOT IN
(
SELECT parent_id
FROM node_parents
)
Upvotes: 2
Reputation: 8291
MySql tends to have issues with subqueries in my experience. Try this
SELECT nodes.node_id,
titles.title
FROM nodes b
INNER JOIN
node_titles nt
ON nt.node_id = n.node_id
INNER JOIN
titles t
ON t.title_id = nt.title_id
LEFT OUTER JOIN
(
SELECT nti.node_id
FROM node_titles nti
INNER JOIN
node_parents npi
ON npi.parent_id = nt.node_id
WHERE nti.title_id = 1
) ThisTable on n.node_id = ThisTable.node_id
WHERE ThisTable.node_id is null
Upvotes: 1