nnarefun
nnarefun

Reputation: 99

Unexpected result when using DISTINCT ON and COUNT() with ORDER BY

PostgreSQL query returns different number of rows and votes number, produced by COUNT() function.

In vote_table posts can be upvoted, then same voter can undo it's vote but same voter can't vote twice:

 post_id | voter_id | is_upvote | timestamp
---------+----------+-----------+-----------
     440 |       28 | f         | timestamp
     440 |      497 | f         | timestamp
     440 |      959 | t         | timestamp
     440 |      959 | f         | timestamp
     440 |      959 | t         | timestamp
     440 |      959 | f         | timestamp
     440 |      959 | t         | timestamp

In this query I ask for last state of the vote using DISTINCT ON:

SELECT
  post_id, voter_id, is_upvote
FROM (
  SELECT DISTINCT ON (voter_id)
    post_id, voter_id, is_upvote
  FROM (
    SELECT post_id, voter_id, is_upvote
    FROM vote_table
    ORDER BY timestamp DESC
  ) votes_ordered
) votes_distinct
WHERE is_upvote IS TRUE AND post_id = 5;

Let's say the above query returns some number of rows where post 5 was upvoted. Then I ask to count number of votes for each post

SELECT
  post_id,
  COUNT(*) AS votes_count
FROM (
  SELECT
    post_id, voter_id, is_upvote
  FROM (
    SELECT DISTINCT ON (voter_id)
      post_id, voter_id, is_upvote
    FROM (
      SELECT post_id, voter_id, is_upvote
      FROM vote_table
      ORDER BY timestamp DESC
    ) votes_ordered
  ) votes_distinct
  WHERE is_upvote IS TRUE;
) votes_approve
GROUP BY post_id
;

In the resulting table, result is incorrect, votes_count will be lower than number of rows, returned by first query. If I re-add the AND post_id = 5 to second query, votes_count is correct

What is happening here?

UPDATE: As suggested by people who responded, I'm adding an example table. Only difference is, the table has id instead of timestamp but it's also sequential

CREATE TEMP TABLE vote_table ( id SERIAL PRIMARY KEY, post_id INT NOT NULL, voter_id INT NOT NULL, is_upvote BOOLEAN NOT NULL );

INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 28, TRUE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 497, TRUE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 959, TRUE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 959, FALSE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 959, TRUE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 959, FALSE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 497, FALSE);
INSERT INTO vote_table (post_id, voter_id, is_upvote)
VALUES (440, 959, TRUE);

For this table, the second query will return correct result (440, 2) but when more entries are added to table, query result can be wrong because ORDER BY can be ignored, as said by Greg Pavlik

The updated question would be: how to write a similar query for selecting number of upvotes for each post, knowing that single voter can only upvote once and can downvote?

Upvotes: 0

Views: 136

Answers (1)

Greg Pavlik
Greg Pavlik

Reputation: 11056

The problem (or at least a problem) is right here:

  SELECT DISTINCT ON (voter_id)
    post_id, voter_id, is_upvote
  FROM (
    SELECT post_id, voter_id, is_upvote
    FROM vote_table
    ORDER BY timestamp DESC

This is a subquery. Think about what happens if you were to create a table based on this. The DBMS is free to store and return the rows any way it wants when selecting from that table. The same holds true for subqueries.

According to the SQL specification, a DMBS does not need to respect an ORDER BY clause in a subquery, only the final result set or places like window partitions and within group expressions. This order by clause here is in a subquery, so there's no guarantee PostgreSQL will maintain that order. In fact, as a performance optimization some DBMS ignore an order by clause in a subquery entirely.

If you need to use the ordering of rows to select specific rows, you need to use an order by in a window partition, within group expression, etc.

If you can update your question with a bit more detail, specifically a table with expected output would be helpful, someone here will surely help out with an applicable window function or other approach.

Upvotes: 3

Related Questions