corgrath
corgrath

Reputation: 12285

How to combine multiple SELECTs into a single SELECT by a common column in (BigQuery) SQL?

Given I have multiple tables in BigQuery, hence I have multiple SQL-statements that gives me "the number of X per day". For example:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as installs
FROM database.table1
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | installs |
-------------------------
| 2017-01-01 | 11       |
| 2017-01-02 | 22       |
etc

Another statement:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as uninstalls
FROM database.table2
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | uninstalls |
---------------------------
| 2017-01-02 | 22         |
| 2017-01-03 | 33         |
etc

Another statement:

SELECT FORMAT_TIMESTAMP("%F",timestamp) AS day, COUNT(*) as cases
FROM database.table3
GROUP BY day
ORDER BY day ASC

Which would give the result:

| day        | cases |
----------------------
| 2017-01-01 | 11    |
| 2017-01-03 | 33    |
etc

etc

Now I need to combine all these into a single SELECT statement that gives the following results:

| day        | installs | uninstalls | cases |
----------------------------------------------
| 2017-01-01 | 11       | 0          | 11    |
| 2017-01-02 | 22       | 22         | 0     |
| 2017-01-03 | 0        | 33         | 33    |
etc

Is this even possible?

Or what's the closest SQL-statement I can write that would give me a similar result?

Any feedback is appreciated!

Upvotes: 0

Views: 151

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173028

Below is for BigQuery Standard SQL

#standardSQL
WITH calendar AS (
  SELECT day
  FROM (
    SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
    FROM (
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
    )
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT 
  c.day AS day, 
  IFNULL(SUM(installs), 0) AS installs,
  IFNULL(SUM(uninstalls), 0) AS uninstalls,
  IFNULL(SUM(cases),0) AS cases  
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs   FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases      FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
-- ORDER BY day  

Please note: you are using timestamp as a column name which is not the best practice as it is keyword, so in my example i leave your naming but consider to change this!

You can test / play this solution with below dummy data

#standardSQL
WITH `database.table1` AS (
  SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs  
  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 
),
`database.table2` AS (
  SELECT TIMESTAMP '2016-12-01' AS timestamp, 1 AS installs  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 
),
`database.table3` AS (
  SELECT TIMESTAMP '2017-01-01' AS timestamp, 1 AS installs  UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL  SELECT TIMESTAMP '2017-01-01', 22 UNION ALL
  SELECT TIMESTAMP '2017-01-10', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 UNION ALL  SELECT TIMESTAMP '2017-01-02', 22 
),
calendar AS (
  SELECT day
  FROM (
    SELECT MIN(min_day) AS min_day, MAX(max_day) AS max_day
    FROM (
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table1` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table2` UNION ALL
      SELECT MIN(DATE(timestamp)) AS min_day, MAX(DATE(timestamp)) AS max_day FROM `database.table3`
    )
  ), UNNEST(GENERATE_DATE_ARRAY(min_day, max_day, INTERVAL 1 DAY)) AS day
)
SELECT 
  c.day AS day, 
  IFNULL(SUM(installs), 0) AS installs,
  IFNULL(SUM(uninstalls), 0) AS uninstalls,
  IFNULL(SUM(cases),0) AS cases  
FROM calendar AS c
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) installs   FROM `database.table1` GROUP BY day) t1 ON t1.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) uninstalls FROM `database.table2` GROUP BY day) t2 ON t2.day = c.day
LEFT JOIN (SELECT DATE(timestamp) day, COUNT(1) cases      FROM `database.table3` GROUP BY day) t3 ON t3.day = c.day
GROUP BY day
HAVING installs + uninstalls + cases > 0
ORDER BY day

Upvotes: 1

Elliott Brossard
Elliott Brossard

Reputation: 33745

Here is a self-contained example that might help to get you started. It uses two dummy tables, InstallEvents and UninstallEvents, which contain timestamps for the respective actions. It creates a common table expression called StartAndEnd that computes the minimum and maximum dates for these events in order to decide which dates to aggregate over, then unions the contents of the InstallEvents and UninstallEvents, counting the events for each day.

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
),
StartAndEnd AS (
  SELECT MIN(DATE(timestamp)) AS min_date, MAX(DATE(timestamp)) AS max_date
  FROM (
    SELECT * FROM InstallEvents UNION ALL
    SELECT * FROM UninstallEvents
  )
)
SELECT
  day,
  COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
  COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
  SELECT *, true AS is_install
  FROM InstallEvents UNION ALL
  SELECT *, false
  FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY(
    (SELECT min_date FROM StartAndEnd),
    (SELECT max_date FROM StartAndEnd)
  )) AS day
GROUP BY day
ORDER BY day;

If you know what the start and end dates are in advance, you can hard-code them in the query instead and then omit the StartAndEnd CTE:

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT
  day,
  COUNTIF(is_install AND DATE(timestamp) = day) AS installs,
  COUNTIF(NOT is_install AND DATE(timestamp) = day) AS uninstalls
FROM (
  SELECT *, true AS is_install
  FROM InstallEvents UNION ALL
  SELECT *, false
  FROM UninstallEvents
)
CROSS JOIN UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-01-04')) AS day
GROUP BY day
ORDER BY day;

To see the events in the sample data, use a query that unions the contents:

WITH InstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-01 00:00:00', INTERVAL x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 100)) AS x
),
UninstallEvents AS (
  SELECT TIMESTAMP_ADD('2017-01-02 00:00:00', INTERVAL 2 * x HOUR) AS timestamp
  FROM UNNEST(GENERATE_ARRAY(0, 50)) AS x
)
SELECT timestamp, true AS is_install
FROM InstallEvents UNION ALL
SELECT timestamp, false
FROM UninstallEvents;

Upvotes: 2

HoneyBadger
HoneyBadger

Reputation: 15150

I am not very familiar with bigquery, so this is probably not going to be a copy-paste answer.

You'll first have to build a calander table to make sure you have all dates. Here's an example for sql server. There are probably examples for bigquery available as well. The following assumes a Calander table with Date attribute in timestamp.

Once you have your calander table you can join all your tables to that:

SELECT      FORMAT_TIMESTAMP("%F",C.Date) AS day
,           COUNT(T1.DATE(T1.TIMESTAMP)) AS installs --Here you could also use your FORMAT_TIMESTAMP
,           COUNT(T1.DATE(T2.TIMESTAMP)) AS uninstalls
FROM        Calander C
LEFT JOIN   database.table1 T1
        ON  DATE(T1.TIMESTAMP) = DATE(C.Date) --Convert to date to remove times, you could also use your FORMAT_TIMESTAMP
LEFT JOIN   database.table2 T2
        ON  DATE(T2.TIMESTAMP) = DATE(C.Date)
GROUP BY    day
ORDER BY    day ASC

Upvotes: 0

Related Questions