Reputation: 2675
I have a SQL Server query that I'm trying to convert to run in BigQuery. There are three tables involved:
CalendarMonths
FirstDayOfMonth | FirstDayOfNextMonth
----------------------------+----------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC
Clients
clientid | name | etc.
---------+----------------+------
1 | Bob's Shop |
2 | Anne's Cookies |
ClientLogs
id | clientid | timestamp | price_current | price_old | license_count_current | license_count_old |
----+----------+----------------+---------------+-----------+-----------------------+---------------
1 | 1 | 2017-02-01 UTC | 1200 | 0 | 10 | 0 |
2 | 1 | 2018-02-03 UTC | 2400 | 1200 | 20 | 10 |
3 | 2 | 2016-07-13 UTC | 1200 | 0 | 10 | 0 |
4 | 2 | 2018-03-30 UTC | 0 | 1200 | 0 | 10 |
The T-SQL query looks something like this:
SELECT
FirstDayOfMonth, FirstDayOfNextMonth,
(SELECT SUM(sizeatdatelog.price_current)
FROM clients c
CROSS APPLY (SELECT TOP 1 *
FROM clientlogs
WHERE clientid = c.clientid
AND [timestamp] < cm.FirstDayOfMonth
ORDER BY [timestamp] DESC) sizeatdatelog
WHERE sizeatdatelog.license_count_current > 0) as StartingRevenue,
(another subquery for starting client count) as StartingClientCount,
(another subquery for churned revenue) as ChurnedRevenue,
(there are about 6 other subqueries)
FROM
CalendarMonths cm
ORDER BY
cm.FirstDayOfMonth
And the final output looks like:
FirstDayOfMonth | FirstDayOfNextMonth | StartingRevenue | StartingClientCount | etc
-------------------------------------------------------------------------------------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC | 68382995.43 | 79430 |
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC | 69843625.12 | 80430 |
In BigQuery, I added a simple subquery in the select clause and it worked great:
SELECT FirstDayOfMonth, FirstDayOfNextMonth, (SELECT clientId FROM clientlogs LIMIT 1 ) as cl
FROM CalendarMonths cm
ORDER BY cm.FirstDayOfMonth
However, as soon as I add a where clause to the subquery, I get this error message:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
How should I proceed from this point? If I can't get the results I'm looking for in one query, maybe I should look into creating multiple scheduled jobs that create temporary tables and then a final scheduled job that joins it all together. Or maybe I could look at doing this in code via GCP or use the BigQuery API in app scripts. The data size isn't huge and the query isn't run often. I'm looking for maintainability more than efficiency, so ideally there is a way to get this data into one query.
Upvotes: 5
Views: 3168
Reputation: 2675
For the sake of the community I'm posting the query I ended up using. Huge thanks to Mikhail Berlyant for his help with this one.
I ended up breaking the query into CTEs so I could use correlated subqueries to get the specific data I needed.
WITH previousMonths AS (
SELECT *
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, account_c,
FIRST_VALUE(acl.timestamp_c ) OVER (start_values) timestamp_c,
FIRST_VALUE(acl.acv_current_c ) OVER (start_values) acv_current_c,
FIRST_VALUE(acl.license_count_current_c) OVER(start_values) license_count_current_c,
FIRST_VALUE(acl.price_current_c) OVER (start_values) price_current_c
FROM warehouse.project.calendar_months cm
JOIN warehouse.project.account_change_logs acl ON timestamp_c < FirstDayOfMonth
WINDOW start_values AS (PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c DESC)
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, account_c,
timestamp_c, acv_current_c, license_count_current_c, price_current_c
),
currentMonth AS (
SELECT *
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, account_c,
FIRST_VALUE(acl.timestamp_c ) OVER (change_values) timestamp_c,
FIRST_VALUE(acl.acv_current_c ) OVER (change_values) acv_current_c,
FIRST_VALUE(acl.license_count_current_c) OVER(change_values) license_count_current_c,
FIRST_VALUE(acl.acv_old_c) OVER(PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c) acv_old_at_start_of_month_c,
FIRST_VALUE(acl.license_count_old_c) OVER(PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c) license_count_old_at_start_of_month_c,
FIRST_VALUE(acl.price_current_c) OVER (change_values) price_current_c
FROM warehouse.project.calendar_months cm
JOIN warehouse.project.account_change_logs acl
ON timestamp_c >= FirstDayOfMonth AND timestamp_c < FirstDayOfNextMonth
WINDOW change_values AS (PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c DESC)
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, account_c,
timestamp_c, acv_current_c, acv_old_at_start_of_month_c, license_count_current_c,
license_count_old_at_start_of_month_c, price_current_c
)
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
(SELECT COUNT(acv_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c > 0) as StartingAccounts,
(SELECT COUNT(acv_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c = 0 AND license_count_current_c > 0) as NewAccounts,
(SELECT COUNT(acv_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c = 0) as ChurnAccounts,
(SELECT SUM(license_count_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c > 0) as StartingUsers,
(SELECT SUM(license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c = 0 AND license_count_current_c > 0) as NewUsers,
(SELECT SUM(license_count_current_c - license_count_old_at_start_of_month_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c < license_count_current_c
AND license_count_old_at_start_of_month_c <> 0) as ExpansionUsers,
(SELECT SUM(license_count_old_at_start_of_month_c - license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c > license_count_current_c
AND license_count_current_c <> 0) as ContractionUsers,
(SELECT SUM(license_count_old_at_start_of_month_c - license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c > license_count_current_c
AND license_count_current_c = 0) as ChurnUsers,
(SELECT SUM(acv_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c > 0) as StartingARR
--etc, etc,
FROM warehouse.project.calendar_months cal
ORDER BY FirstDayOfMonth
Upvotes: 0
Reputation: 136
Correlated subquery like
SELECT TOP 1 * FROM clientlogs WHERE clientid = c.clientid AND [timestamp] < cm.FirstDayOfMonth ORDER BY [timestamp] DESC)
in BigQuery usually needs to be rewritten through aggregation along the lines of
SELECT ARRAY_AGG(foo ORDER BY [timestamp] DESC LIMIT 1)[offset(0)] FROM ... as foo WHERE correlated condition
BigQuery more likely to work with simple correlated subqueries in the form of
SELECT {optional aggregation} FROM table WHERE {correlated condition}
Upvotes: 0
Reputation: 172974
Below is for BigQuery Standard SQL
#standardSQL
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
SUM(price_current) StartingRevenue, COUNT(1) StartingClientCount
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
clientid, price_current
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, clientid,
FIRST_VALUE(price_current) OVER(latest_values) price_current,
FIRST_VALUE(license_count_current) OVER(latest_values) license_count_current
FROM `project.dataset.CalendarMonths` cm
JOIN `project.dataset.ClientLogs` cl
ON `timestamp` < FirstDayOfMonth
WINDOW latest_values AS (PARTITION BY clientid ORDER BY `timestamp` DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
WHERE license_count_current > 0
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, clientid, price_current
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth
ORDER BY FirstDayOfMonth
most likely above can be extended to the rest of your subqueries
Upvotes: 3