Jonathan McIntire
Jonathan McIntire

Reputation: 2675

Getting around BigQuery subquery & apply limitations

I have a SQL Server query that I'm trying to convert to run in BigQuery. There are three tables involved:

CalendarMonths

     FirstDayOfMonth        |    FirstDayOfNextMonth
----------------------------+----------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC

Clients

clientid |     name       | etc.
---------+----------------+------
1        |  Bob's Shop    |
2        | Anne's Cookies |

ClientLogs

 id | clientid |   timestamp    | price_current | price_old | license_count_current | license_count_old |
----+----------+----------------+---------------+-----------+-----------------------+---------------
1   |     1    | 2017-02-01 UTC |      1200     |     0     |          10           |          0        |
2   |     1    | 2018-02-03 UTC |      2400     |    1200   |          20           |         10        |
3   |     2    | 2016-07-13 UTC |      1200     |     0     |          10           |          0        |
4   |     2    | 2018-03-30 UTC |       0       |    1200   |           0           |         10        |

The T-SQL query looks something like this:

SELECT 
    FirstDayOfMonth, FirstDayOfNextMonth,
    (SELECT SUM(sizeatdatelog.price_current) 
     FROM clients c
     CROSS APPLY (SELECT TOP 1 * 
                  FROM clientlogs 
                  WHERE clientid = c.clientid 
                    AND [timestamp] < cm.FirstDayOfMonth 
                  ORDER BY [timestamp] DESC) sizeatdatelog
     WHERE sizeatdatelog.license_count_current > 0) as StartingRevenue,
    (another subquery for starting client count) as StartingClientCount,
    (another subquery for churned revenue) as ChurnedRevenue,
    (there are about 6 other subqueries)
FROM 
    CalendarMonths cm
ORDER BY 
    cm.FirstDayOfMonth

And the final output looks like:

     FirstDayOfMonth        |    FirstDayOfNextMonth      | StartingRevenue | StartingClientCount | etc
-------------------------------------------------------------------------------------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC |   68382995.43   |        79430        |
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC |   69843625.12   |        80430        |

In BigQuery, I added a simple subquery in the select clause and it worked great:

SELECT FirstDayOfMonth, FirstDayOfNextMonth, (SELECT clientId FROM clientlogs LIMIT 1 ) as cl 
FROM CalendarMonths cm
ORDER BY cm.FirstDayOfMonth

However, as soon as I add a where clause to the subquery, I get this error message:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

How should I proceed from this point? If I can't get the results I'm looking for in one query, maybe I should look into creating multiple scheduled jobs that create temporary tables and then a final scheduled job that joins it all together. Or maybe I could look at doing this in code via GCP or use the BigQuery API in app scripts. The data size isn't huge and the query isn't run often. I'm looking for maintainability more than efficiency, so ideally there is a way to get this data into one query.

Upvotes: 5

Views: 3168

Answers (3)

Jonathan McIntire
Jonathan McIntire

Reputation: 2675

For the sake of the community I'm posting the query I ended up using. Huge thanks to Mikhail Berlyant for his help with this one.

I ended up breaking the query into CTEs so I could use correlated subqueries to get the specific data I needed.

WITH previousMonths AS (
    SELECT *
    FROM (
        SELECT FirstDayOfMonth, FirstDayOfNextMonth, account_c,
          FIRST_VALUE(acl.timestamp_c ) OVER (start_values) timestamp_c,
          FIRST_VALUE(acl.acv_current_c ) OVER (start_values) acv_current_c,
          FIRST_VALUE(acl.license_count_current_c) OVER(start_values) license_count_current_c,
          FIRST_VALUE(acl.price_current_c) OVER (start_values) price_current_c
        FROM warehouse.project.calendar_months cm
        JOIN warehouse.project.account_change_logs acl ON timestamp_c < FirstDayOfMonth
        WINDOW start_values AS (PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c DESC)
    )
    GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, account_c, 
      timestamp_c, acv_current_c, license_count_current_c, price_current_c
  ), 
  currentMonth AS (
    SELECT *
    FROM (
        SELECT FirstDayOfMonth, FirstDayOfNextMonth, account_c,
          FIRST_VALUE(acl.timestamp_c ) OVER (change_values) timestamp_c,
          FIRST_VALUE(acl.acv_current_c ) OVER (change_values) acv_current_c,
          FIRST_VALUE(acl.license_count_current_c) OVER(change_values) license_count_current_c,
          FIRST_VALUE(acl.acv_old_c) OVER(PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c) acv_old_at_start_of_month_c,
          FIRST_VALUE(acl.license_count_old_c) OVER(PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c) license_count_old_at_start_of_month_c,
          FIRST_VALUE(acl.price_current_c) OVER (change_values) price_current_c
        FROM warehouse.project.calendar_months  cm
        JOIN warehouse.project.account_change_logs acl
        ON timestamp_c >= FirstDayOfMonth AND timestamp_c < FirstDayOfNextMonth 
        WINDOW change_values AS (PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c DESC)
    )

    GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, account_c, 
      timestamp_c, acv_current_c, acv_old_at_start_of_month_c, license_count_current_c, 
      license_count_old_at_start_of_month_c, price_current_c
)

SELECT FirstDayOfMonth, FirstDayOfNextMonth, 
  (SELECT COUNT(acv_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_current_c > 0) as StartingAccounts,

  (SELECT COUNT(acv_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_old_at_start_of_month_c = 0 AND license_count_current_c > 0) as NewAccounts,

  (SELECT COUNT(acv_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_current_c = 0) as ChurnAccounts,

  (SELECT SUM(license_count_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_current_c > 0) as StartingUsers,

  (SELECT SUM(license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_old_at_start_of_month_c = 0 AND license_count_current_c > 0) as NewUsers,

  (SELECT SUM(license_count_current_c - license_count_old_at_start_of_month_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_old_at_start_of_month_c < license_count_current_c
    AND license_count_old_at_start_of_month_c <> 0) as ExpansionUsers,

  (SELECT SUM(license_count_old_at_start_of_month_c - license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_old_at_start_of_month_c > license_count_current_c
    AND license_count_current_c <> 0) as ContractionUsers,

  (SELECT SUM(license_count_old_at_start_of_month_c - license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_old_at_start_of_month_c > license_count_current_c
    AND license_count_current_c = 0) as ChurnUsers,

  (SELECT SUM(acv_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
    AND license_count_current_c > 0) as StartingARR 

    --etc, etc,


FROM warehouse.project.calendar_months cal
ORDER BY FirstDayOfMonth

Upvotes: 0

Alex227
Alex227

Reputation: 136

Correlated subquery like

SELECT TOP 1 * FROM clientlogs WHERE clientid = c.clientid AND [timestamp] < cm.FirstDayOfMonth ORDER BY [timestamp] DESC)

in BigQuery usually needs to be rewritten through aggregation along the lines of

SELECT ARRAY_AGG(foo ORDER BY [timestamp] DESC LIMIT 1)[offset(0)] FROM ... as foo WHERE correlated condition

BigQuery more likely to work with simple correlated subqueries in the form of

SELECT {optional aggregation} FROM table WHERE {correlated condition}

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

Below is for BigQuery Standard SQL

#standardSQL
SELECT FirstDayOfMonth, FirstDayOfNextMonth, 
  SUM(price_current) StartingRevenue, COUNT(1) StartingClientCount 
FROM (
  SELECT FirstDayOfMonth, FirstDayOfNextMonth, 
    clientid, price_current
  FROM (
    SELECT FirstDayOfMonth, FirstDayOfNextMonth, clientid,
      FIRST_VALUE(price_current) OVER(latest_values) price_current,
      FIRST_VALUE(license_count_current) OVER(latest_values) license_count_current
    FROM `project.dataset.CalendarMonths` cm
    JOIN `project.dataset.ClientLogs` cl
    ON `timestamp` < FirstDayOfMonth 
    WINDOW latest_values AS (PARTITION BY clientid ORDER BY `timestamp` DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  )
  WHERE license_count_current > 0
  GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, clientid, price_current
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth
ORDER BY FirstDayOfMonth  

most likely above can be extended to the rest of your subqueries

Upvotes: 3

Related Questions