SMc
SMc

Reputation: 1

Why does using DISTINCT ON () at different points in a query return different (unintuitive) results?

I’m querying from a table that has repeated uuids, and I want to remove duplicates. I also want to exclude some irrelevant data which requires joining on another table. I can remove duplicates and then exclude irrelevant data, or I can switch the order and exclude then remove duplicates. Intuitively, I feel like if anything, removing duplicates then joining should produce more rows than joining and then removing duplicates, but that is the opposite of what I’m seeing. What am I missing here?

In this one, I remove duplicates in the first subquery and filter in the second, and I get 500k rows:

with tbl1 as (
select distinct on (uuid) uuid, foreign_key
from original_data
where date > some_date
),

tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
)

select * from tbl2 

If I filter then remove duplicates, I get 550k rows:

with tbl1 as (
select uuid, foreign_key
from original_data
where date > some_date
),

tbl2 as (
select uuid
from tbl1
left join other_data
on tbl1.foreign_key = other_data.id
where other_data.category <> something
),

tbl3 as (
select distinct on (uuid) uuid
from tbl2
)

select * from tbl3 

Is there an explanation here?

Upvotes: 0

Views: 144

Answers (1)

Zegarek
Zegarek

Reputation: 26163

Does original_data.foreign_key have a foreign key constraint referencing other_data.id allowing for foreign_keys that don't link to any id in other_data?

Isn't other_data.category or original_data.foreign_key column missing a NOT NULL constraint?

In either of these cases postgres would filter out all records with

  1. a missing link (foreign_key=null)
  2. a broken link (foregin_key doesn't match any id in other_data)
  3. linking to an other_data record with a category set o null

in both of your approaches - regardless of whether they're a duplicate or not - as other_data.category <> something evaluates to null for them which does not satisfy the WHERE clause. That, combined with missing ORDER BY causing DISTINCT ON to drop different duplicates randomly each time, could result in dropping the duplicates that then get filtered out in tbl2 in the first approach, but not in the second. Example:

pgsql122=# select * from original_data;
 uuid | foreign_key |                      comment
------+-------------+---------------------------------------------------
    1 |           1 | correct, non-duplicate record with a correct link
    3 |           2 | duplicate record with a broken link
    3 |           1 | duplicate record with a correct link
    4 |        null | duplicate record with a missing link
    4 |           1 | duplicate record with a correct link
    5 |           3 | duplicate record with a correct link, but a null category behind it
    5 |           1 | duplicate record with a correct link
    6 |        null | correct, non-duplicate record with a missing link
    7 |           2 | correct, non-duplicate record with a broken link
    8 |           3 | correct, non-duplicate record with a correct link, but a null category behind it

pgsql122=# select * from other_data;
 id | category
----+----------
  1 | a
  3 | null
  1. Both of your approaches keep uuid 1 and eliminate uuid 6, 7 and 8 even though they're unique.
  2. Your first approach randomly keeps between 0 and 3 out of the 3 pairs of duplicates (uuid 3, 4 and 5), depending on which one in each pair gets discarded by DISTINCT ON.
  3. Your second approach always keeps one record for each uuid 3, 4 and 5. Each clone with missing link, a broken link or a link with a null category behind it is already gone by the time you discard duplicates.

As @a_horse_with_no_name suggested, ORDER BY should make DISTINCT ON consistent and predictable but only as long as records vary on the columns used for ordering. It also won't help if you have other issues, like the one I suggest.

Upvotes: 1

Related Questions