Reputation: 3
I have an query which gives me the wrong result.
Tables:
A
+----+
| id |
+----+
| 1 |
| 2 |
+----+
B
+----+----+
| id | x | B.id = A.id
+----+----+
| 1 | 1 |
| 1 | 1 |
| 1 | 0 |
+----+----+
C
+----+----+
| id | y | C.id = A.id
+----+----+
| 1 | 1 |
| 1 | 2 |
+----+----+
What I want to do: Select all rows from A. For each row in A count in B all x with value 1 and all x with value 0 with B.id = A.id. For each row in A get the minimum y from C with C.id = A.id.
The result I am expecting is:
+----+------+--------+---------+
| id | min | count1 | count 2 |
+----+------+--------+---------+
| 1 | 1 | 2 | 1 |
| 2 | NULL | 0 | 0 |
+----+------+--------+---------+
First Try: This doesn't work.
SELECT a.id,
MIN(c.y),
SUM(IF(b.x = 1, 1, 0)),
SUM(IF(b.x = 0, 1, 0))
FROM a
LEFT JOIN b
ON ( a.id = b.id )
LEFT JOIN c
ON ( a.id = c.id )
GROUP BY a.id
+----+------+--------+---------+
| id | min | count1 | count 2 |
+----+------+--------+---------+
| 1 | 1 | 4 | 2 |
| 2 | NULL | 0 | 0 |
+----+------+--------+---------+
Second Try: This works but I am sure it has a bad performance.
SELECT a.id,
MIN(c.y),
b.x,
b.y
FROM a
LEFT JOIN (SELECT b.id, SUM(IF(b.x = 1, 1, 0)) x, SUM(IF(b.x = 0, 1, 0)) y FROM b) b
ON ( a.id = b.id )
LEFT JOIN c
ON ( a.id = c.id )
GROUP BY a.id
+----+------+--------+---------+
| id | min | count1 | count 2 |
+----+------+--------+---------+
| 1 | 1 | 2 | 1 |
| 2 | NULL | 0 | 0 |
+----+------+--------+---------+
Last Try: This works too.
SELECT x.*,
SUM(IF(b.x = 1, 1, 0)),
SUM(IF(b.x = 0, 1, 0))
FROM (SELECT a.id,
MIN(c.y)
FROM a
LEFT JOIN c
ON ( a.id = c.id )
GROUP BY a.id) x
LEFT JOIN b
ON ( b.id = x.id )
GROUP BY x.id
Now my question is: Is the last one the best choise or is there a way to write this query with just one select statement (like in the first try)?
Upvotes: 0
Views: 128
Reputation: 1270391
Your joins are doing cartesian products for a given value, because there are multiple rows in each table.
You can fix this by using count(distinct)
rather than sum()
:
SELECT a.id, MIN(c.y),
count(distinct (case when b.x = 1 then b.id end)),
count(distinct (case when b.x = 0 then b.id end))
FROM a
LEFT JOIN b
ON ( a.id = b.id )
LEFT JOIN c
ON ( a.id = c.id )
GROUP BY a.id;
You can also fix this by pre-aggregating b
(and/or c
). And you would need to take that approach if your aggregation function were something like the sum of a column in b
.
EDIT:
You are correct. The above query counts the distinct values of B
, but B
contains rows that are exact duplicates. (Personally, I think having a column with the name id
that has duplicates is a sign of poor design, but that is another issue.)
You could solve it by having a real id
in the b
table, because then the count(distinct)
would count the correct values. You can also solve it by aggregating the two tables before joining them in:
SELECT a.id, c.y, x1, x0
FROM a
LEFT JOIN (select b.id,
sum(b.x = 1) as x1,
sum(b.x = 0) as x0
from b
group by b.id
) b
ON ( a.id = b.id )
LEFT JOIN (select c.id, min(c.y) as y
from c
group by c.id
) c
ON ( a.id = c.id );
Here is a SQL Fiddle for the problem.
EDIT II:
You can get it in one statement, but I'm not so sure that it would work on similar data. The idea is that you can count all the cases where x = 1
and then divide by the number of rows in the C
table to get the real distinct count:
SELECT a.id, MIN(c.y),
coalesce(sum(b.x = 1), 0) / count(distinct coalesce(c.y, -1)),
coalesce(sum(b.x = 0), 0) / count(distinct coalesce(c.y, -1))
FROM a
LEFT JOIN b
ON ( a.id = b.id )
LEFT JOIN c
ON ( a.id = c.id )
GROUP BY a.id;
It is a little tricky, because you have to handle NULL
s to get the right values. Note that this is counting the y
value to get a distinct count from the C
table. Your question re-enforces why it is a good idea to have a unique integer primary key in every table.
Upvotes: 3