David542
David542

Reputation: 110227

LIMITING GROUP BYs in BigQuery

I am trying to get the top 3 providers by revenue, and for each of those providers, the top 2 titles for that provider alphabetically. Here is the basic GROUP BY query I have to start with --

SELECT
  provider,
  title,
  SUM(customer_price)
FROM
  [integrated-myth-156821:fintest.m10]
GROUP BY
  title,
  provider
LIMIT
  25

Here is the result set I get:

enter image description here

And the table structure is provider | title | customer_price.

Here is an illustration of the result I am looking to get:

provider        title              sum(revenue)
Sony                               10000 // Top provider by revenue, limit 3
Sony            Avatar             400 // First title alphabetically, limit 2
Sony            Titanic            300
Fox                                8000
Fox             Fresh              12
Fox             Mike               800
WB                                 4500
WB              Harry              1000
WB              Potter             1200

Upvotes: 2

Views: 3403

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below solution is with BigQuery Standard SQL

#standardSQL
WITH topProviders AS (
  SELECT provider, SUM(customer_price) AS revenue
  FROM `integrated-myth-156821.fintest.m10`
  GROUP BY provider
  ORDER BY revenue DESC 
  LIMIT 3
),
titles AS (
  SELECT provider, title, SUM(customer_price) AS revenue
  FROM `integrated-myth-156821.fintest.m10`
  GROUP BY provider, title
)
SELECT provider, title, revenue
FROM (
  SELECT p.revenue AS sort,
    STRUCT(p.provider, '' AS title, p.revenue ) AS providers, 
    ARRAY_AGG(STRUCT(t.provider, title, t.revenue) ORDER BY title LIMIT 2) AS titles
  FROM topProviders AS p 
  JOIN titles AS t 
  ON p.provider = t.provider
  GROUP BY p.provider, p.revenue 
), UNNEST(ARRAY_CONCAT([providers], titles)) 
ORDER BY sort DESC, provider, title

Upvotes: 2

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Try below

#legacySQL
SELECT
  provider, title, revenue
FROM  
-- Top 2 (alphabetically) Titles for each of Top 3 Providers 
( 
SELECT
  titles.provider AS provider,
  titles.title AS title,
  titles.revenue AS revenue,
  top_providers.revenue AS pos
FROM (
  SELECT provider, title, revenue, ROW_NUMBER() OVER(PARTITION BY provider ORDER BY title) AS pos 
  FROM (
    SELECT provider, title, SUM(customer_price) AS revenue 
    FROM [integrated-myth-156821:fintest.m10]
    GROUP BY provider, title
  )
) AS titles
JOIN (
  SELECT
    provider,
    SUM(customer_price) AS revenue
  FROM [integrated-myth-156821:fintest.m10]
  GROUP BY provider
  ORDER BY revenue DESC  
  LIMIT 3 -- sets top 3 providers
) top_providers 
ON top_providers.provider = titles.provider
WHERE pos < 3 -- set top two titles
),
-- Top 3 Providers by Revenue
( 
  SELECT
    provider,
    '' AS title,
    SUM(customer_price) AS revenue,
    SUM(customer_price) AS pos
  FROM [integrated-myth-156821:fintest.m10]
  GROUP BY provider
  ORDER BY revenue DESC  
  LIMIT 3 -- sets top 3 providers
)
ORDER BY pos DESC, provider, title  

could you please explain some of the query?

This query is actually a union of two queries
Second (on the very bottom of query) provides top 3 by revenue providers

The first one gives top 2 titles for each provider in top 3 - and by itself is join of top 3 providers with titles revenues

pos in below expression

ROW_NUMBER() OVER(PARTITION BY provider ORDER BY title) AS pos     

allows then to filter in only 2 top titles with

WHERE pos < 3  

Upvotes: 1

Related Questions