Reputation: 13522
I'm running two queries.
The first one gets unique IDs. This executes in ~350ms.
select parent_id
from duns_match_sealed_air_072815
group by duns_number
Then I paste those IDs into this second query. With >10k ids pasted in, it also executes in about ~350ms.
select term, count(*) as count
from companies, business_types, business_types_to_companies
where
business_types.id = business_types_to_companies.term_id
and companies.id = business_types_to_companies.company_id
and raw_score > 25
and diversity = 1
and company_id in (paste,ten,thousand,ids,here)
group by term
order by count desc;
When I combine these queries into one it takes a long time to execute. I don't know how long because I stopped it after minutes.
select term, count(*) as count
from companies, business_types, business_types_to_companies
where
business_types.id = business_types_to_companies.term_id
and companies.id = business_types_to_companies.company_id
and raw_score > 25
and diversity = 1
and company_id in (
select parent_id
from duns_match_sealed_air_072815
group by duns_number
)
group by term
order by count desc;
What is going on?
Upvotes: 0
Views: 58
Reputation: 108651
First, with respect, your subquery doesn't use GROUP BY
in a sensible way.
select parent_id /* wrong GROUP BY */
from duns_match_sealed_air_072815
group by duns_number
In fact, it misuses the pernicious MySQL extension to GROUP BY
. Read this. http://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html . I can't tell what your application logic intends from this query, but I can tell you that it actually returns an unpredictably selected parent_id
value associated with each distinct duns_number
value.
Do you want
select MIN(parent_id) parent_id
from duns_match_sealed_air_072815
group by duns_number
or something like that? That one selects the lowest parent ID associated with each given number.
Sometimes MySQL has a hard time optimizing the WHERE .... IN ()
query pattern. Try a join instead. Like this:
select term, count(*) as count
from companies
join (
select MIN(parent_id) parent_id
from duns_match_sealed_air_072815
group by duns_number
) idlist ON companies.id = idlist.parent_id
join business_types_to_companies ON companies.id = business_types_to_companies.company_id
join business_types ON business_types.id = business_types_to_companies.term_id
where raw_score > 25
and diversity = 1
group by term
order by count desc
To optimize this further we'll need to see the table definitions and the output from EXPLAIN.
Upvotes: 1
Reputation: 37059
The query has been re-written using JOIN
, but particularly I've used EXISTS
instead of IN
. This is a short in the dark. It is possible that there may be many values generated in the sub-query causing the outer query to struggle while it goes through matching each item returned from the sub-query.
select term, count(*) as count
from companies c
inner join business_types_to_companies bc on bc.company_id = c.id
inner join business_types b on b.id = bc.term_id
where
raw_score > 25
and diversity = 1
and exists (
select 1
from duns_match_sealed_air_072815
where parent_id = c.id
)
group by term
order by count desc;
Upvotes: 1
Reputation: 1541
It's down to the way it processes the query - I believe it has to run your embedded query once for each row, whereas using two queries allows you to store the result.
Hope this helps!
Upvotes: 1