Reputation: 1
I’m querying from a table that has repeated uuids, and I want to remove duplicates. I also want to exclude some irrelevant data which requires joining on another table. I can remove duplicates and then exclude irrelevant data, or I can switch the order and exclude then remove duplicates. Intuitively, I feel like if anything, removing duplicates then joining should produce more rows than joining and then removing duplicates, but that is the opposite of what I’m seeing. What am I missing here?
In this one, I remove duplicates in the first subquery and filter in the second, and I get 500k rows:
with tbl1 as (
select distinct on (uuid) uuid, foreign_key
from original_data
where date > some_date
),
tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
)
select * from tbl2
If I filter then remove duplicates, I get 550k rows:
with tbl1 as (
select uuid, foreign_key
from original_data
where date > some_date
),
tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
),
tbl3 as (
select distinct on (uuid) uuid
from tbl2
)
select * from tbl3
Is there an explanation here?
Upvotes: 0
Views: 144
Reputation: 26163
Does original_data.foreign_key
have a foreign key constraint referencing other_data.id
allowing for foreign_key
s that don't link to any id
in other_data
?
Isn't other_data.category
or original_data.foreign_key
column missing a NOT NULL
constraint?
In either of these cases postgres would filter out all records with
foreign_key=null
)foregin_key
doesn't match any id in other_data
)other_data
record with a category set o nullin both of your approaches - regardless of whether they're a duplicate or not - as other_data.category <> something
evaluates to null for them which does not satisfy the WHERE
clause. That, combined with missing ORDER BY
causing DISTINCT ON
to drop different duplicates randomly each time, could result in dropping the duplicates that then get filtered out in tbl2
in the first approach, but not in the second.
Example:
pgsql122=# select * from original_data;
uuid | foreign_key | comment
------+-------------+---------------------------------------------------
1 | 1 | correct, non-duplicate record with a correct link
3 | 2 | duplicate record with a broken link
3 | 1 | duplicate record with a correct link
4 | null | duplicate record with a missing link
4 | 1 | duplicate record with a correct link
5 | 3 | duplicate record with a correct link, but a null category behind it
5 | 1 | duplicate record with a correct link
6 | null | correct, non-duplicate record with a missing link
7 | 2 | correct, non-duplicate record with a broken link
8 | 3 | correct, non-duplicate record with a correct link, but a null category behind it
pgsql122=# select * from other_data;
id | category
----+----------
1 | a
3 | null
DISTINCT ON
.As @a_horse_with_no_name suggested, ORDER BY
should make DISTINCT ON
consistent and predictable but only as long as records vary on the columns used for ordering. It also won't help if you have other issues, like the one I suggest.
Upvotes: 1