LittleBobbyTables
LittleBobbyTables

Reputation: 4473

Window function to select the name of the majority row in Redshift Postgres

I have a dataset like this where some rows are useful, but corrupted.

create table pages (
  page varchar,
  cat varchar,
  hits int
);

insert into pages values
(1, 'asdf', 1),
(1, 'fdsa', 2),
(1, 'Apples', 321),
(2, 'gwegr', 30),
(2, 'hsgsdf', 2),
(2, 'Bananas', 321);

I want to know the correct category for each page, and the total hits. The correct category is the one with the most hits. I'd like to have a dataset like:

page | category | sum_of_hits
-----------------------------
1    | Apples   |  324
2    | Bananas  |  353

The furthest I can get is:

SELECT page,
       last_value(cat) over (partition BY page ORDER BY hits) as category,
       sum(hits) as sum_of_hits
FROM pages
GROUP BY 1, 2

But it is erroring: ERROR: column "pages.hits" must appear in the GROUP BY clause or be used in an aggregate function Position: 83.

I tried putting the hits in an aggregate - ORDER BY max(hits) but that doesn't make sense and isn't what I want.

Fiddle: http://sqlfiddle.com/#!17/cb3c2/17

Upvotes: 3

Views: 355

Answers (4)

Gordon Linoff
Gordon Linoff

Reputation: 1270713

You seem to want to take the sum of hits and the maximum. This is easy enough:

select page, sum(hits) as total_hits,
       max(case when seqnum = 1 then category end) as category
from (select p.*,
             row_number() over (partition by page order by hits desc) as seqnum
      from pages p
     ) p
group by page;

Upvotes: 0

cowbert
cowbert

Reputation: 3463

Use a subquery:

select page, cat, hits from
    (select page, cat, hits 
     ,max(hits) over (partition by page) as m_hits 
     from pages) t
where m_hits = hits

Upvotes: 0

cha
cha

Reputation: 10411

There are two problems here:

First is the usage of the last_value. Read a note in the Postgres documentation about the last value:

Note that first_value, last_value, and nth_value consider only the rows within the "window frame", which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for nth_value and particularly last_value. You can redefine the frame as being the whole partition by adding ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to the OVER clause. See Section 4.2.8 for more information.

I recommend you convert it to first_value:

SELECT page,
       first_value(cat) over (partition BY page ORDER BY hits DESC) as category,
       hits
FROM pages

The second problem is that you can't use the window function directly in the GROUP BY clause, you need to use a subquery or a cte:

select page, category,
       sum(hits)
from (
SELECT page,
       first_value(cat) over (partition BY page ORDER BY hits DESC) as category,
       hits
FROM pages
) a
GROUP BY 1, 2

SQL Fiddle

Upvotes: 2

klin
klin

Reputation: 121804

Use the window function first_value() over the reversed order of hits in a derived table (a subquery in FROM clause):

select 
    page, 
    category,
    sum(hits) as sum_of_hits
from (
    select
        page,
        first_value(cat) over (partition by page order by hits desc) as category,
        hits
    from pages
    ) s
group by 1, 2
order by 1;

SqlFiddle.

Upvotes: 1

Related Questions