Reputation: 87
I want to return unique question but even with DISTINCT if I have one question that has multiple answers all of the same question with the answers get return I only want that question return one time here is my sql
SELECT
DISTINCT questions.id AS question_id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN votes ON answers.id = votes.answer_id;
Upvotes: 0
Views: 207
Reputation: 657002
To return a particular row from each group, you need to add a deterministic ORDER BY
clause. Basically:
SELECT DISTINCT ON (q.id)
q.id AS question_id
, q.title AS question_title
, q.created_at AS question_created_at
, q.updated_at AS question_updated_at
, a.id AS answer_id
, a.content AS answer_content
, a.created_at AS answer_created_at
, a.updated_at AS answer_updated_at
, SUM(v.value) AS votes
FROM questions q
LEFT JOIN answers a ON q.id = a.question_id
LEFT JOIN votes v ON a.id = v.answer_id
GROUP BY q.id, a.id -- for the sum
ORDER BY q.id, a.created_at DESC NULLS LAST, a.id;
The first ORDER BY
item must agree with the DISTINCT ON
clause.
You want the "latest" answer, so a.created_at DESC
is next.
NULLS LAST
because the column might be nullable (you did not disclose).
The final a.id
only serves a tiebreaker in case multiple answers tie on a.created_at
.
Detailed explanation:
After joining to votes
already, the correlated subquery for the sum of votes is not needed:
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
Currently, you probably get incorrect (multiplied) sums. Assuming a one-to-many relationship between answers
and votes
(else, the vote count could just be added as another column to answers
), it's either-or: Either join to the table, then GROUP BY
, or do not join to the table and add that correlated subquery.
I fixed it with a plain sum()
keeping the join, assuming q.id
and a.id
are the respective primary keys of their tables (you did not disclose the table definition). That's possible because DISTINCT ON
is applied after GROUP BY
. See:
Or see below for a possibly better solution.
While you return all or most questions the query is typically faster if you join after getting the latest answer per question. Like:
SELECT q.id AS question_id
, q.title AS question_title
, q.created_at AS question_created_at
, q.updated_at AS question_updated_at
, a.id AS answer_id
, a.content AS answer_content
, a.created_at AS answer_created_at
, a.updated_at AS answer_updated_at
, u.user_name -- whatever you need from users table
, (SELECT SUM(value) FROM votes v WHERE v.answer_id = a.answer_id) AS votes
FROM questions q
LEFT JOIN (
SELECT DISTINCT ON (a.question_id)
a.question_id AS id
, a.id AS answer_id
, a.content AS answer_content
, a.created_at AS answer_created_at
, a.updated_at AS answer_updated_at
, a.user_id
FROM answers a
ORDER BY a.question_id, a.created_at DESC NULLS LAST, a.id
) a USING (id)
LEFT JOIN users u ON u.id = a.user_id
Here, I kept the correlated subquery for votes
because it's typically cheaper to do that after reducing to chosen answers instead of counting for all answers.
Similar for users
(added in your answer): join after reducing to the chosen answer. And put something from users
in the SELECT
list to actually return from it.
If your table answers
is big, a multicolumn index on answer(question_id, created_at DESC NULLS LAST)
would be ideal for performance.
If there are many answers per question, a different query technique may be faster. See:
For retrieving a small percentage of all questions, LATERAL
or correlated subqueries are typically faster.
Details depend on undisclosed table definitions and cardinalities.
Upvotes: 1
Reputation: 35593
select distinct
is a row operator
this means that it inspects every selected column and considers if that row is different to every other row. If the current row is different from any other row due to any of the selected columns, it will be returned. In your case you have joined questions to answers and I assume that as soon as you have more than one person in a survey then you will potentially get different answers, and hence those differences result in more rows. If you just need distinct questions then don't also join to tables that will multiply the results.
However in your query you also appear to want a SUM()
so instead of pursuing the use of select distinct
perhaps you could consider use of group by
instead, like this:
SELECT
questions.id AS question_id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
COUNT(DISTINCT answers.id) and num_answers,
MIN(answers.created_at) AS answer_created_at,
MAX(answers.updated_at) AS answer_updated_at,
SUM(votes.value) AS votes
FROM questions
LEFT JOIN answers
ON questions.id = answers.question_id
LEFT JOIN votes
ON answers.id = votes.answer_id
GROUP BY
questions.id,
questions.title,
questions.created_at,
questions.updated_at
In Postgres there is additional qualifier for distinct being select distinct on (...)
but to control the result it should be used in conjunction with an order by clause
The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. (ref)
So this will reduce the number of returned rows, and can be controlled to output the most recently created answer using an order by
clause
SELECT DISTINCT ON (questions.id)
questions.id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN votes ON answers.id = votes.answer_id;
ORDER BY questions.id, answers.created_at DESC
A more generic (not Postgres specific) solution to finding "the latest row" is to use row_number() over()
. In addition instead of summing through votes via a "correlated subquery" summarise the votes by answer first, then join to the remaining tables, like this:
SELECT
questions.id AS question_id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
votes.votes
FROM questions
LEFT JOIN ( SELECT
a.*,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY created_at DESC) AS rn
FROM answers AS a
) AS answers
ON questions.id = answers.question_id
AND answers.rn = 1
LEFT JOIN (
SELECT v.answer_id, SUM(v.value) as votes
FROM votes as v
GROUP BY v.answer_id
) AS votes
ON answers.id = votes.answer_id
Upvotes: 0
Reputation: 87
Got it to work in getting the lastest answer
`SELECT
DISTINCT ON (questions.id) questions.id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN users ON users.id = answers.user_id
LEFT JOIN votes ON answers.id = votes.answer_id
ORDER BY questions.id, answers.created_at DESC NULLS LAST, answers.id;
Upvotes: 0
Reputation: 71
You should use "DISTINCT ON" instead of "DISTINCT".
SELECT
DISTINCT ON (questions.id) questions.id,
questions.title AS question_title,
questions.created_at AS questionCreatedAt,
questions.updated_at AS question_updated_at,
answers.id AS answer_id,
answers.content AS answer_content,
answers.created_at AS answer_created_at,
answers.updated_at AS answer_updated_at,
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
FROM questions
LEFT JOIN answers ON questions.id = answers.question_id
LEFT JOIN votes ON answers.id = votes.answer_id;
Upvotes: 0