Joining two datasets with subqueries

Question

I am attempting to join two large datasets using BigQuery. they have a common field, however the common field has a different name in each dataset.

I want to count number of rows and sum the results of my case logic for both table1 and table2.

I believe that I have errors resulting from subquery (subselect?) and syntax errors. I have tried to apply precedent from similar posts but I still seem to be missing something. Any assistance in getting this sorted is greatly appreciated.

SELECT
table1.field1,
table1.field2,
    (
    SELECT COUNT (*)
    FROM table1) AS table1_total,
sum(case when table1.mutually_exclusive_metric1 = "Y" then 1 else 0 end) AS t1_pass_1,
sum(case when table1.mutually_exclusive_metric1 = "Y" AND table1.mutually_exclusive_metric2 IS null OR table1.mutually_exclusive_metric3 = 'Y' then 1 else 0 end) AS t1_pass_2, 
sum(case when table1.mutually_exclusive_metric3 ="Y" AND table1.mutually_exclusive_metric2 ="Y" AND table1.mutually_exclusive_metric3 ="Y" then 1 else 0 end) AS  t1_pass_3,
    (
    SELECT COUNT (*)
    FROM table2) AS table2_total,
sum(case when table2.metric1 IS true then 1 else 0 end) AS t2_pass_1,
sum(case when table2.metric2 IS true then 1 else 0 end) AS t2_pass_2,
    (
        SELECT COUNT (*)
        FROM dataset1.table1 JOIN EACH dataset2.table2 ON common_field_table1 =  common_field_table2) AS overlap 
FROM
dataset1.table1,
dataset2.table2
WHERE
XYZ

Thanks in advance!

Roman · Accepted Answer

Sho. Lets take this one step at a time:
1) Using * is not explicit, and being explicit is good. Additionally, stating explicit selects and * will duplicate selects with autorenames. table1.field will become table1_field. Unless you are just playing around, don't use *.

2) You never joined. A query with a join looks like this (note order of WHERE and GROUP statements, note naming of each):

SELECT
  t1.field1 AS field1,
  t2.field2 AS field2
FROM dataset1.table1 AS t1

JOIN dataset2.table2 AS t2
ON t1.field1 = t2.field1

WHERE t1.field1 = "some value"

GROUP BY field1, field2

Where t1.f1 = t2.f1 contain corresponding values. You wouldn't repeat those in the select.

3) Use whitespace to make your code easier to read. It helps everyone involved, including you.

4) Your subselects are pretty useless. A subselect is used instead of creating a new table. For example, you would use a subselect to group or filter out data from an existing table. For example:

SELECT
  subselect.field1 AS ssf1,
  subselect.max_f1 AS ss_max_f1
FROM (
    SELECT
        t1.field1 AS field1,
        MAX(t1.field1) AS max_f1,
    FROM dataset1.table1 AS t1

    GROUP BY field1
) AS subselect

The subselect is practically a new table that you select from. Treat it logically like it happens first, and you take the results from that and use it in your main select.

5) This was a terrible question. It didn't even look like you tried to figure things out one step at a time.

Joining two datasets with subqueries

Answers (1)

Related Questions