aristofun
aristofun

Reputation: 375

Fetch latest rows grouped by uniqe field value

I have a table of Books with an author_id field.

I'd like to fetch an array of Books which contain only one Book of every author. The one with the latest updated_at field.

The problem with straightforward approach like Books.all.group('author_id') on Postgres is that it needs all requested field in its GROUP BY block. (See https://stackoverflow.com/a/6106195/1245302)

But I need to get all Book objects one per author, the recent one, ignoring all other fields. It seems to me that there's enough data for the DBMS to find exactly the rows I want, at least I could do that myself without any other fields in GROUP BY block. :)

Is there any simple Rails 3 + Postgres (version < 9) or SQL implementation independent way to get that?

UPDATE Nice solution for Postgres:

books.unscoped.select('DISTINCT ON(author_id) *').order('author_id').order('updated_at DESC') 

BUT! there still problem remains – results are sorted by author_id in the first place, but i need to sort by updated_at inside the same author_id-s (to find, say the top-10 recent book authors).

And Postgres doesn't allow you to change order of ORDER BY arguments in DISTINCT queries :(

Upvotes: 3

Views: 1018

Answers (2)

RachelS
RachelS

Reputation: 68

This is belated, but in response to questions about overriding/resetting a default order, use .reorder(nil).order(:whatever_you_want_instead)

(I can't comment, so posting as an answer for now)

Upvotes: 0

kgrittn
kgrittn

Reputation: 19511

I don't know Rails, but hopefully showing you the SQL for what you want will help get you to a way to generate the right SQL.

SELECT DISTINCT ON (author_id) *
  FROM Books
  ORDER BY author_id, updated_at DESC;

The DISTINCT ON (author_id) portion should not be confused with part of the result column list -- it just says that there will be one row per author_id. The list in a DISTINCT ON clause must be the leading portion of the ORDER BY clause in such a query, and the row which is kept is the one which sorts first based on the rest of the ORDER BY clause.

With a large number of rows this way of writing the query is usually much faster than any solution based on GROUP BY or window functions, often by an order of magnitude or more. It is a PostgreSQL extension, though; so it should not be used in code which is intended to be portable.

If you want to use this result set inside another query (for example, to find the 10 most recently updated authors), there are two ways to do that. You can use a subquery, like this:

SELECT *
  FROM (SELECT DISTINCT ON (author_id) *
          FROM Books
          ORDER BY author_id, updated_at DESC) w
  ORDER BY updated_at DESC
  LIMIT 10;

You could also use a CTE, like this:

WITH w AS (
  SELECT DISTINCT ON (author_id) *
    FROM Books
    ORDER BY author_id, updated_at DESC)
SELECT * FROM w
  ORDER BY updated_at DESC
  LIMIT 10;

The usual advice about CTEs holds here: use them only where there isn't another way to write the query or if needed to coerce the planner by introducing an optimization barrier. The plans are very similar, but passing the intermediate results through the CTE scan adds a little overhead. On my small test set the CTE form is 17% slower.

Upvotes: 1

Related Questions