Reputation: 2305
Overview:
I have three tables 1) subscribers, bios, and shirtsizes and i need to find the subscribers without a bio or shirtsizes
the tables are laid out such as
subscribers
| season_id | user_id |
bio
| bio_id | user_id |
shirt sizes
| bio_id | shirtsize |
And I need to find all users who do not have a bio or shirtsize, (if no bio; then no shirtsize via relation) for any given season.
I originally wrote a query like:
SELECT *
FROM subscribers s
LEFT JOIN bio b ON b.user_id = subscribers.user_id
LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);
but it is taking 10 seconds to complete now.
I am wondering how I can restructure the query (or possibly the problem) so that it will preform reasonably.
Here is the mysql explain: (ogu = subscribers, b = bio, tn = shirtshize)
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+-------------+--------+-------------+
| 1 | SIMPLE | ogu | ref | PRIMARY | PRIMARY | 4 | const | 133 | Using where |
| 1 | SIMPLE | b | index | NULL | PRIMARY | 8 | NULL | 187644 | Using index |
| 1 | SIMPLE | tn | ref | nid | nid | 4 | waka2.b.nid | 1 | Using where |
The above is pretty sanitized, here's the realz info:
mysql> DESCRIBE subscribers
+-----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| subscribers | int(11) | NO | PRI | | |
| uid | int(11) | NO | PRI | | |
mysql> DESCRIBE bio;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id | int(10) unsigned | NO | PRI | 0 | |
| uid | int(10) unsigned | NO | PRI | 0 | |
mysql> DESCRIBE shirtsize;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| bio_id | int(10) unsigned | NO | PRI | 0 | |
| shirtsize | int(10) unsigned | NO | PRI | 0 | |
and the real query looks like:
SELECT ogu.nid, ogu.is_active, ogu.uid, b.nid AS bio_node, tn.nid AS size
FROM og_uid ogu
LEFT JOIN bio b ON b.uid = ogu.uid
LEFT JOIN term_node tn ON tn.nid = b.nid
WHERE ogu.nid = 185033 AND ogu.is_admin = 0
AND (b.nid IS NULL OR tn.tid IS NULL)
nid is season_id or bio_id (with a type); term_node is going to be the shirtsize
Upvotes: 10
Views: 53659
Reputation: 425261
Your query, as it is written now, evaluates all bio
's and term_node
's if they exist, and then filters them out.
But what you want is just find og_uid
's that don't have term_node
's (not having a bio
also implies not having a term_node
)
So you just want to stop evaluating bio
's and term_node
's as soon as you find the first existing term_node
:
SELECT *
FROM (
SELECT ogu.nid, ogu.is_active, ogu.uid,
(
SELECT 1
FROM bio b, term_node tn
WHERE b.uid = ogu.uid
AND tn.nid = b.nid
LIMIT 1
) AS ex
FROM og_uid ogu
WHERE ogu.nid = 185033
AND ogu.is_admin = 0
) ogu1
WHERE ex IS NULL
This will evaluate at most one bio
and at most one term_node
for each og_uid
, instead of evaluating all existing thousands and the filtering them out.
Should work much faster.
Upvotes: 1
Reputation:
I presume that your "big table" is subscribers, and that season_id is probably neither selective nor indexed (indexing it is rather meaningless if it's not selective, anyway), which means that you'll have to fully scan subscribers, anyway. Parting, I would join (with an inner join) the two other tables - note that if there is no bio_id in shirt_size it's exactly the same for your query as if there were no bio. First bit:
select uid
from bio
inner join shirtsizes
on shirtsizes.bio_id = bio.bio_id
At which point you want to check that shirtsizes is indexed on bio_id. Now you can left outer join this query to subscribers:
select *
from subscribers s
left outer join (select uid
from bio
inner join shirtsizes
on shirtsizes.bio_id = bio.bio_id) x
on x.uid = s.uid
where s.season_id = 185181
and x.uid is null
which is likely to run reasonably fast if neither bio nor shirtsizes are gigantic ...
Upvotes: 0
Reputation: 10344
If you define what you are looking for exactly rather than SELECT * it might speed it up a bit... also OR is not the fastest query to be doing, if you can re-write it without the OR it will be faster.
Also... you could try unions instead of left joins maybe?
SELECT s.user_id FROM subscribers s LEFT JOIN bio b ON b.user_id = s.user_id LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);
would be something like:
(SELECT s.user_id FROM subscribers s WHERE s.season_id = 185181) UNION (SELECT b.user_id, b.bio_id FROM bio b WHERE bio.bio_id IS NULL) UNION (SELECT shirtsizes.bio_id FROM shirtsizes WHERE shirtsizes.size is NULL)
(to be honest that doesn't look right to me... but then I never use joins or join syntax or unions...)
I would do:
SELECT * FROM subscribers s, bio b, shirtsizes sh WHERE s.season_id = 185181 AND shirtsize.bio_id = bio.bio_id AND b.user_id = s.user_id AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);
Upvotes: 1
Reputation: 532435
Obviously I haven't checked this but it seems to be that what you want is to select any subscriber where there there isn't a matching bio or the join between bios and shirtsizes fails. I would consider using NOT EXISTS for this condition. You'll probably want indexes on bio.user_id and shirtsizes.bio_id.
select *
from subscribers
where s.season_id = 185181
and not exists (select *
from bio join shirtsizes on bio.bio_id = shirtsizes.bio_id
where bio.user_id = subscribers.user_id)
EDIT:
Based on your update, you may want to create separate keys on each column instead of/in addition to having compound primary keys. It's possible that the joins aren't able to take optimal advantage of the compound primary indexes and an index on the join columns themselves may speed things up.
Upvotes: 4
Reputation: 16896
select * from subscribers where user_id not in (
select user_id from bio where bio_id not in (
select bio_id from shirt_sizes
)
) and season_id=185181
Upvotes: 0
Reputation: 753515
Would it be any quicker to do a difference between the list of subscribers for the relevant season and the list of subscribers for the season with bios and shirt sizes?
SELECT *
FROM Subscribers
WHERE season_id = 185181
AND user_id NOT IN
(SELECT DISTINCT s.user_id
FROM subscribers s
JOIN bios b ON s.user_id = b.user_id
JOIN shirtsizes z ON b.bio_id = z.bio_id
WHERE s.season_id = 185181
)
This avoids outer joins, which are not as fast as inner joins, and may therefore be quicker. On the other hand, it might be creating two large lists with very few differences between them. It is not clear whether the DISTINCT in the sub-query would improve or harm performance. It implies a sort operation (expensive) but paves the way for a merge-join if the MySQL optimizer supports such things.
There might be other notations available - MINUS or DIFFERENCE, for example.
Upvotes: 1
Reputation: 161773
Is bio_id
the primary key of bios? Is it really possible for there to be a bios row with b.user_id
= subscribers.user_id
but with b.bio_id
NULL?
Are there shirtsize rows with shirtsize.bio_id
NULL? Do those rows ever have shirtsize.size not NULL?
Upvotes: 1
Reputation: 2221
Joins are one of the most expensive operations that you can perform on an SQL query. While it should be able to automatically optimize your query somewhat, maybe try restructuring it. First of all, I would instead of SELECT *, be sure to specify which columns you need from which relations. This will speed things up quite a bit.
If you only need the user ID for example:
SELECT s.user_id
FROM subscribers s
LEFT JOIN bio b ON b.user_id = subscribers.user_id
LEFT JOIN shirtsizes ON shirtsize.bio_id = bio.bio_id
WHERE s.season_id = 185181 AND (bio.bio_id IS NULL OR shirtsize.size IS NULL);
That will allow the SQL database to restructure your query a little more efficiently on its own.
Upvotes: 15
Reputation: 19627
The query should be OK. I would run it through a query analyzer and refine the indexes on the tables.
Upvotes: 17