Reputation: 1278
I have the following SQL:
SELECT j.AssocJobKey
, COUNT(DISTINCT o.ID) AS SubjectsOrdered
, COUNT(DISTINCT s.ID) AS SubjectsShot
FROM Jobs j
LEFT JOIN Orders o ON o.AssocJobKey = j.AssocJobKey
LEFT JOIN Subjects s ON j.AssocJobKey = s.AssocJobKey
GROUP BY
j.AssocJobKey
,j.JobYear
The basic structure is a Job is the parent that is unique by AssocJobKey and has a one to many relationships with Subjects and Orders. The query gives me what I want, the output looks like this:
| AssocJobKey | SubjectsOrdered | SubjectsShot |
|-----------------------|------------------------|---------------------|
| BAT-H181 | 107 | 830 |
|--------------------- |------------------------|---------------------|
| BAT-H131 | 226 | 1287 |
The problem is the query is way to heavy and my memory is spiking, there's no way I could run this on a large dataset. If I remove one of the LEFT JOINs on the corresponding count the query executes instantly and theres no problem. So somehow things are bouncing around between the two left joins more than they should, but I don't understand why they would.
Really hoping to avoid joining on sub selects if at all possible.
Upvotes: 1
Views: 103
Reputation: 1270191
Your query is generating a Cartesian product for each job. And this is big -- your second row has about 500k rows being generated. COUNT(DISTINCT)
then has to figure out the unique ids among this Cartesian product.
The solution is simple: pre-aggregate:
SELECT j.AssocJobKey, o.SubjectsOrdered, s.SubjectsShot
FROM Jobs j LEFT JOIN
(SELECT o.AssocJobKey, COUNT(*) as SubjectsOrdered
FROM Orders o
GROUP BY o.AssocJobKey
) o
ON o.AssocJobKey = j.AssocJobKey LEFT JOIN
(SELECT j.AssocJobKey, COUNT(s.ID) AS SubjectsShot
FROM Subjects s
GROUP BY j.AssocJobKey
) s
ON j.AssocJobKey = s.AssocJobKey;
This makes certain assumptions that I think are reasonable:
id
s in the orders and subjects table are unique and non-NULL.jobs.AssocJobKey
is unique.The query can be easily adapted if either of these are not true, but they seem like reasonable assumptions.
Often for these types of joins over different dimensions, COUNT(DISTINCT)
is a reasonable solution (the queries are certainly simpler). This is true when there are at most a handful of values.
Upvotes: 1