Why does using DISTINCT ON () at different points in a query return different (unintuitive) results?

Question

I’m querying from a table that has repeated uuids, and I want to remove duplicates. I also want to exclude some irrelevant data which requires joining on another table. I can remove duplicates and then exclude irrelevant data, or I can switch the order and exclude then remove duplicates. Intuitively, I feel like if anything, removing duplicates then joining should produce more rows than joining and then removing duplicates, but that is the opposite of what I’m seeing. What am I missing here?

In this one, I remove duplicates in the first subquery and filter in the second, and I get 500k rows:

with tbl1 as (
select distinct on (uuid) uuid, foreign_key
from original_data
where date > some_date
),

tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
)

select * from tbl2

If I filter then remove duplicates, I get 550k rows:

with tbl1 as (
select uuid, foreign_key
from original_data
where date > some_date
),

tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
),

tbl3 as (
select distinct on (uuid) uuid
from tbl2
)

select * from tbl3

Is there an explanation here?

Why does using DISTINCT ON () at different points in a query return different (unintuitive) results?

Answers (1)

Related Questions