sqlpostgresqlaggregategreatest-n-per-group

Reputation: 87

Returning unique items in postgres

I want to return unique question but even with DISTINCT if I have one question that has multiple answers all of the same question with the answers get return I only want that question return one time here is my sql

SELECT 
DISTINCT questions.id AS question_id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN votes ON answers.id = votes.answer_id;

Upvotes: 0

Answers (4)

Erwin Brandstetter

Reputation: 657002

To return a particular row from each group, you need to add a deterministic ORDER BY clause. Basically:

SELECT DISTINCT ON (q.id)
       q.id AS question_id
     , q.title AS question_title
     , q.created_at AS question_created_at
     , q.updated_at AS question_updated_at
     , a.id AS answer_id
     , a.content AS answer_content
     , a.created_at AS answer_created_at
     , a.updated_at AS answer_updated_at
     , SUM(v.value) AS votes
FROM   questions q
LEFT   JOIN answers a ON q.id = a.question_id
LEFT   JOIN votes   v ON a.id = v.answer_id
GROUP  BY q.id, a.id   -- for the sum
ORDER  BY q.id, a.created_at DESC NULLS LAST, a.id;

The first ORDER BY item must agree with the DISTINCT ON clause.
You want the "latest" answer, so a.created_at DESC is next.
NULLS LAST because the column might be nullable (you did not disclose).
The final a.id only serves a tiebreaker in case multiple answers tie on a.created_at.

Detailed explanation:

Select first row in each GROUP BY group?

After joining to votes already, the correlated subquery for the sum of votes is not needed:

(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)

Currently, you probably get incorrect (multiplied) sums. Assuming a one-to-many relationship between answers and votes (else, the vote count could just be added as another column to answers), it's either-or: Either join to the table, then GROUP BY, or do not join to the table and add that correlated subquery.

I fixed it with a plain sum() keeping the join, assuming q.id and a.id are the respective primary keys of their tables (you did not disclose the table definition). That's possible because DISTINCT ON is applied after GROUP BY. See:

Best way to get result count before LIMIT was applied

Or see below for a possibly better solution.

While you return all or most questions the query is typically faster if you join after getting the latest answer per question. Like:

SELECT q.id AS question_id
     , q.title AS question_title
     , q.created_at AS question_created_at
     , q.updated_at AS question_updated_at
     , a.id AS answer_id
     , a.content AS answer_content
     , a.created_at AS answer_created_at
     , a.updated_at AS answer_updated_at
     , u.user_name                -- whatever you need from users table
     , (SELECT SUM(value) FROM votes v WHERE v.answer_id = a.answer_id) AS votes
FROM   questions q
LEFT   JOIN (
   SELECT DISTINCT ON (a.question_id)
          a.question_id AS id
        , a.id AS answer_id
        , a.content AS answer_content
        , a.created_at AS answer_created_at
        , a.updated_at AS answer_updated_at
        , a.user_id
   FROM   answers    a
   ORDER  BY a.question_id, a.created_at DESC NULLS LAST, a.id
   ) a USING (id)
LEFT JOIN users u ON u.id = a.user_id

Here, I kept the correlated subquery for votes because it's typically cheaper to do that after reducing to chosen answers instead of counting for all answers.

Similar for users (added in your answer): join after reducing to the chosen answer. And put something from users in the SELECT list to actually return from it.

If your table answers is big, a multicolumn index on answer(question_id, created_at DESC NULLS LAST) would be ideal for performance.
If there are many answers per question, a different query technique may be faster. See:

Optimize GROUP BY query to retrieve latest row per user

For retrieving a small percentage of all questions, LATERAL or correlated subqueries are typically faster.

Details depend on undisclosed table definitions and cardinalities.

Upvotes: 1

Paul Maxwell

Reputation: 35593

select distinct is a row operator this means that it inspects every selected column and considers if that row is different to every other row. If the current row is different from any other row due to any of the selected columns, it will be returned. In your case you have joined questions to answers and I assume that as soon as you have more than one person in a survey then you will potentially get different answers, and hence those differences result in more rows. If you just need distinct questions then don't also join to tables that will multiply the results.

However in your query you also appear to want a SUM() so instead of pursuing the use of select distinct perhaps you could consider use of group by instead, like this:

SELECT
  questions.id AS question_id,
  questions.title AS question_title,
  questions.created_at AS questionCreatedAt,
  questions.updated_at AS question_updated_at,
  COUNT(DISTINCT answers.id) and num_answers,
  MIN(answers.created_at) AS answer_created_at,
  MAX(answers.updated_at) AS answer_updated_at,
  SUM(votes.value) AS votes
FROM questions
LEFT JOIN answers
  ON questions.id = answers.question_id
LEFT JOIN votes
  ON answers.id = votes.answer_id
GROUP BY
  questions.id,
  questions.title,
  questions.created_at,
  questions.updated_at

In Postgres there is additional qualifier for distinct being select distinct on (...) but to control the result it should be used in conjunction with an order by clause

The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. (ref)

So this will reduce the number of returned rows, and can be controlled to output the most recently created answer using an order by clause

SELECT DISTINCT ON (questions.id)
 questions.id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN votes ON answers.id = votes.answer_id; 
ORDER BY questions.id, answers.created_at DESC

A more generic (not Postgres specific) solution to finding "the latest row" is to use row_number() over(). In addition instead of summing through votes via a "correlated subquery" summarise the votes by answer first, then join to the remaining tables, like this:

SELECT
  questions.id AS question_id,
  questions.title AS question_title,
  questions.created_at AS questionCreatedAt,
  questions.updated_at AS question_updated_at
  answers.id AS answer_id,
  answers.content AS answer_content,
  answers.created_at AS answer_created_at,
  answers.updated_at AS answer_updated_at,
  votes.votes
FROM questions
LEFT JOIN ( SELECT
            a.*,
            ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY created_at DESC) AS rn
            FROM answers AS a
  ) AS answers
  ON questions.id = answers.question_id
  AND answers.rn = 1
LEFT JOIN (
            SELECT v.answer_id, SUM(v.value) as votes
            FROM votes as v
            GROUP BY v.answer_id
  ) AS votes
  ON answers.id = votes.answer_id

Upvotes: 0

sebas tian

Reputation: 87

Got it to work in getting the lastest answer

`SELECT
            DISTINCT ON (questions.id) questions.id,
            questions.title AS question_title,
            questions.created_at AS questionCreatedAt,
            questions.updated_at AS question_updated_at,
            answers.id AS answer_id,
            answers.content AS answer_content,
            answers.created_at AS answer_created_at,
            answers.updated_at AS answer_updated_at,
            (SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
            FROM questions
            LEFT JOIN answers ON questions.id = answers.question_id
            LEFT JOIN users ON  users.id  = answers.user_id
            LEFT JOIN votes ON answers.id = votes.answer_id
            ORDER  BY questions.id, answers.created_at DESC NULLS LAST, answers.id;

Upvotes: 0

Viktor Shvachuk

Reputation: 71

You should use "DISTINCT ON" instead of "DISTINCT".

SELECT 
DISTINCT ON (questions.id) questions.id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN votes ON answers.id = votes.answer_id;

Returning unique items in postgres

Answers (4)

Related Questions