Select first or random row in group by

Question

I have this query using PostgreSQL 9.1 (9.2 as soon as our hosting platform upgrades):

SELECT
    media_files.album,
    media_files.artist,
    ARRAY_AGG (media_files. ID) AS media_file_ids
FROM
    media_files
INNER JOIN playlist_media_files ON media_files.id = playlist_media_files.media_file_id
WHERE
    playlist_media_files.playlist_id = 1
GROUP BY
    media_files.album,
    media_files.artist
ORDER BY
    media_files.album ASC

and it's working fine, the goal was to extract album/artist combinations and in the result set have an array of media files ids for that particular combo.

The problem is that I have another column in media files, which is artwork.

artwork is unique for each media file (even in the same album) but in the result set I need to return just the first of the set.

So, for an album that has 10 media files, I also have 10 corresponding artworks, but I would like just to return the first (or a random picked one for that collection).

Is that possible to do with only SQL/Window Functions (first_value over..)?

Craig Ringer · Accepted Answer

Yes, it's possible. First, let's tweak your query by adding alias and explicit column qualifiers so it's clear what comes from where - assuming I've guessed correctly, since I can't be sure without table definitions:

SELECT
    mf.album,
    mf.artist,
    ARRAY_AGG (mf.id) AS media_file_ids
FROM
    "media_files" mf
INNER JOIN "playlist_media_files" pmf ON mf.id = pmf.media_file_id
WHERE
    pmf.playlist_id = 1
GROUP BY
    mf.album,
    mf.artist
ORDER BY
    mf.album ASC

Now you can either use a subquery in the SELECT list or maybe use DISTINCT ON, though it looks like any solution based on DISTINCT ON will be so convoluted as not to be worth it.

What you really want is something like an pick_arbitrary_value_agg aggregate that just picks the first value it sees and throws the rest away. There is no such aggregate and it isn't really worth implementing it for the job. You could use min(artwork) or max(artwork) and you may find that this actually performs better than the later solutions.

To use a subquery, leave the ORDER BY as it is and add the following as an extra column in your SELECT list:

(SELECT mf2.artwork 
 FROM media_files mf2 
 WHERE mf2.artist = mf.artist
   AND mf2.album = mf.album
 LIMIT 1) AS picked_artwork

You can at a performance cost randomize the selected artwork by adding ORDER BY random() before the LIMIT 1 above.

Alternately, here's a quick and dirty way to implement selection of a random row in-line:

(array_agg(artwork))[width_bucket(random(),0,1,count(artwork)::integer)]

Since there's no sample data I can't test these modifications. Let me know if there's an issue.

Select first or random row in group by

Answers (2)

"First" pick

Abitrary / random pick

Related Questions